This project consists of ingestion, transformation, storage, and analytics of Spotify data. 20 of my favorite artists have been picked and their data is pulled via the Spotify API in Python. Data includes artist info, albums, and songs. This data has been stored in a SQLite database. Views have been built ontop to make it useful for analytics. Finally, visualizations have been developed using the Seaborn library in Python.
Tools Used: Python, SQLite3, Spotipy, Pandas, Seaborn
The ETL pipeline conducts the following 3 things to build the database:
- Spotify data is extracted via the Spotify API. Data is retreived in JSON format and converted into Pandas dataframes, which I refer to as raw dataframes.
- The raw dataframes are cleaned and formated in a way that fits the schema requirements. The Pandas library is used for these transformations.
- The newly cleaned dataframes are inserted into their respective tables in the SQLite database.
- Views have also been built joining and aggregating these tables.
sql_queries.py
- contains all SQL queries used in this project.create_database.py
- builds the SQLite database along with all tables and views. Calls and executes the queries insql_queries.py
etl.py
- runs the entire ETL pipline described above.spotify.db
- The SQLite database.visualization.ipynb
- The notebook used for building the visualizations.visualization.pdf
- Presents the visualizations built invisualization.ipynb
.run.py
- Runs the necessary files in order to complete the project via subprocess. It first executescreate_database.py
followed byetl.py
.
$ python run.py
or
$ python create_database.py
$ python etl.py
- Create separate functions for more comprehensive validation checks.
- Verify and enforce schema should the Spotify API change. Alert if so.
- Create staging tables for storing raw, unprocessed data.
- Include RAW_JSON columns to give analysts/scientists the option to self parse.
- Include LOAD_ID or LOAD_TS columns