This repo provides utils for data ingestion and managing db.
By decoupling data processing/ingestion from backend repo:
- backend code could be easily deployed by anyone anywhere without many code modifications/creating forks
- complexity of managing data available on different instances is reduced
Project is a way to group data together. Usually - one project per city. eg. "po8klasie warszawa", "po8klasie gdynia"
Data source is single atomic element of data ingestion
Environment is a single deployment/instance of backend app/db.
Chosen environment impacts:
- db credentials (different for different environments)
- intermediate files' directory (directories for each environment are separate, could be overriden)
- data source configs (specifying different data processing params depending on environment)
Data processing could be parametrized to behave differently based on environment
You should provide environment name to config module mapping when invoking DataManager
Environment config contains:
- db connection string
- project configs
- data sources' configs
- intermediate files path
Sometimes files are too large to eg. query an API and put files directly to db. Intermediate files are introduced as kind of cache layer between pulling data from source and putting them to db. Having this kind of local cache comes in handy when playing locally with backend.
I'm thinking about using S3 or GCP buckets instead of local file system for storing intermediate files but for now IMHO there's no need for that
Ultimately data ingestion for each data source should look like this:
- Creating intermediate files (pulling data from API/somewhere)
- Creating records (from intermediate files)
Data sources which rely on rarely changing inputs (spreadsheets processed by pandas etc) or manually edited data may skip this step. (You still should check out processed files to the repo)
Intermediate files should not be checked out to data-manager repo
Using poetry (recommended)
poetry add git+https://github.com/po8klasie/po8klasie-data-sources
Using pip
pip install git+https://github.com/po8klasie/po8klasie-data-sources
Note: As project is in early stage, we do not follow sem ver
#!/usr/bin/env python3
from po8klasie_data_sources.lib.data_manager import DataManager
from po8klasie_data_sources.data_sources.rspo.data_source import RspoDataSource
from po8klasie_data_sources.data_sources.osm_public_transport_info.data_source import OSMPublicTransportInfoDataSource
data_sources = [
# you can add here data sources imported from this repo
# or provide custom ones (warning: unstable API)
RspoDataSource,
OSMPublicTransportInfoDataSource
]
config_modules = {
# environment to config module mapping
'local': 'po8klasie_data_management_example.config.local'
}
data_manager = DataManager(
data_sources=data_sources,
config_modules=config_modules
)
if __name__ == "__main__":
data_manager.init_cli(__file__)
Idea of config module is similar to django's settings.py file
Config module must include variables:
DATABASE_URL
- database connection stringINTERMEDIATE_FILES_DIR
- intermediate files dir path relative todata-manager.py
filePROJECT_CONFIGS
- project configs structured like below<uppercased data source id>_DATA_SOURCE_CONFIG
variables for each data source (always a dict)- this dict may include
disable
property (bool) to disable data source
- this dict may include
NOTE: You can use env vars here. They'll be loaded based on selected environment
Example:
import os
DATABASE_URL = os.environ.get("DATABASE_URL")
INTERMEDIATE_FILES_DIR = "./data/intermediate_files/local"
PROJECT_CONFIGS = [
{
'project_id': 'gdynia',
'project_name': 'Gdynia'
},
]
RSPO_DATA_SOURCE_CONFIG = {
'borough_names_per_project': {
'gdynia': ['gdynia']
}
}
OSM_PUBLIC_TRANSPORT_INFO_DATA_SOURCE_CONFIG = {
'stop_distance_from_institution': 250
}
.env.local
file:
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/postgres
Run CLI commands using poetry run ./data-manager.py ...
Order of running commands:
0. (drop_db
)
create_db_schema
init_projects
create_intermediate_files
create_records
Order of data sources used to create intermediate files/records also matter. For now, it's not possible to recreate records from single data source. IDK if we would want that anytime soon, bc of relatively little time to regenerate all db and usage of intermediate files
Drop whole database
Flags:
-e <environment>
,--environment <environment>
,local
by default
Create database schema
Flags:
-e <environment>
,--environment <environment>
,local
by default
Initialize projects. You need to init projects as first step after creating db schema
Flags:
-e <environment>
,--environment <environment>
,local
by default
Create intermediate files.
You can provide multiple data sources (their ids) separated by whitespace.
You can also pass __all__
as data source to create intermediate files for all data sources.
TODO: Allow tagging data sources with custom labels and running create_intermediate_files __<custom label>__
to only create files for data sources with given label.
Flags:
-e <environment>
,--environment <environment>
,local
by default--override-intermediate-files-dir <environment>
create records based on intermediate files directory of different environment
Create intermediate files.
You can provide multiple data sources (their ids) separated by whitespace.
You can also pass __all__
as data source to create intermediate files for all data sources.
TODO: Allow tagging data sources with custom labels and running create_intermediate_files __<custom label>__
to only create files for data sources with given label.
Flags:
-e <environment>
,--environment <environment>
,local
by default--override-intermediate-files-dir <environment>
create records based on intermediate files directory of different environment
Alias for:
drop_db
create_sb_schema
init_projects
create_records __all__
This command passes --environment
flag to all commands it invokes.
--override-intermediate-files-dir
flag is passed to create_records
command.
Beware: API is not stable.
Linter: poetry run black .
Formatter: poetry run flake8
Tests: TODO