Define a function called run_pipeline()
that will run the ETL pipeline
This function can then be called in a name == "main" block or by in a lambda function.
Extract functions should get the needed data and return a pandas dataframe.
Transform functions should be pure functions that take a pandas dataframe as an argument and returns a data frame. Ideal functions would allow the use of pandas https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pipe.html.
Load functions should take data frames and write them to their indent destinations
We support two ways of including secrets in ETL pipelines. The first is through environment variables. The second using AWS Secret manager.
It is important to log the execution of our data pipelines. To log directly to cloudwatch get a logger from the get_logger function in etl.logging. This supports normal python logging.
Scheduling by crontab or other means should be clearly documented in the etl_pipeline
- Update the requirements in
setup.py
file. - Update the verion in
setup.py
in a separate commit. - Build dist tar.gz file:
And after publish the artifact to pypi
python setup.py sdist
python3 -m twine upload dist/*
- Tag the new release on github.
- Visit releases
- Draft a new release (keep the format same as
setup.py
file, e.g. v0.0.4 - Submit the new release