Miles Martinez Submission

Project Summary

This project consists of ingestion, transformation, storage, and analytics of Spotify data. 20 of my favorite artists have been picked and their data is pulled via the Spotify API in Python. Data includes artist info, albums, and songs. This data has been stored in a SQLite database. Views have been built ontop to make it useful for analytics. Finally, visualizations have been developed using the Seaborn library in Python.

Tools Used: Python, SQLite3, Spotipy, Pandas, Seaborn

Design

ETL Pipeline

The ETL pipeline conducts the following 3 things to build the database:

Spotify data is extracted via the Spotify API. Data is retreived in JSON format and converted into Pandas dataframes, which I refer to as raw dataframes.
The raw dataframes are cleaned and formated in a way that fits the schema requirements. The Pandas library is used for these transformations.
The newly cleaned dataframes are inserted into their respective tables in the SQLite database.
Views have also been built joining and aggregating these tables.

Files

sql_queries.py - contains all SQL queries used in this project.
create_database.py - builds the SQLite database along with all tables and views. Calls and executes the queries in sql_queries.py
etl.py - runs the entire ETL pipline described above.
spotify.db - The SQLite database.
visualization.ipynb - The notebook used for building the visualizations.
visualization.pdf - Presents the visualizations built in visualization.ipynb.
run.py - Runs the necessary files in order to complete the project via subprocess. It first executes create_database.py followed by etl.py.

How to Run

$ python run.py

or

$ python create_database.py
$ python etl.py

Ways to improve

Create separate functions for more comprehensive validation checks.
Verify and enforce schema should the Spotify API change. Alert if so.
Create staging tables for storing raw, unprocessed data.
Include RAW_JSON columns to give analysts/scientists the option to self parse.
Include LOAD_ID or LOAD_TS columns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Miles Martinez Submission

Project Summary

Design

ETL Pipeline

Files

How to Run

Ways to improve

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
create_database.py		create_database.py
etl.py		etl.py
run.py		run.py
spotify.db		spotify.db
sql_queries.py		sql_queries.py
visualization.ipynb		visualization.ipynb
visualization.pdf		visualization.pdf

MilesMartinez/spotify_data_engineer_project

Folders and files

Latest commit

History

Repository files navigation

Miles Martinez Submission

Project Summary

Design

ETL Pipeline

Files

How to Run

Ways to improve

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages