GitHub - scikit-learn-contrib/skdag: A more flexible alternative to scikit-learn Pipelines

skdag - A more flexible alternative to scikit-learn Pipelines

scikit-dag (skdag) is an open-sourced, MIT-licenced library that provides advanced workflow management to any machine learning operations that follow scikit-learn conventions. Installation is simple:

pip install skdag

It works by introducing Directed Acyclic Graphs as a drop-in replacement for traditional scikit-learn Pipeline. This gives you a simple interface for a range of use cases including complex pre-processing, model stacking and benchmarking.

from skdag import DAGBuilder

dag = (
   DAGBuilder(infer_dataframe=True)
   .add_step("impute", SimpleImputer())
   .add_step("vitals", "passthrough", deps={"impute": ["age", "sex", "bmi", "bp"]})
   .add_step(
      "blood",
      PCA(n_components=2, random_state=0),
      deps={"impute": ["s1", "s2", "s3", "s4", "s5", "s6"]}
   )
   .add_step(
      "rf",
      RandomForestRegressor(max_depth=5, random_state=0),
      deps=["blood", "vitals"]
   )
   .add_step("svm", SVR(C=0.7), deps=["blood", "vitals"])
   .add_step(
      "knn",
      KNeighborsRegressor(n_neighbors=5),
      deps=["blood", "vitals"]
   )
   .add_step("meta", LinearRegression(), deps=["rf", "svm", "knn"])
   .make_dag()
)

dag.show(detailed=True)

The above DAG imputes missing values, runs PCA on the columns relating to blood test results and leaves the other columns as they are. Then they get passed to three different regressors before being passed onto a final meta-estimator. Because DAGs (unlike pipelines) allow predictors in the middle or a workflow, you can use them to implement model stacking. We also chose to run the DAG steps in parallel wherever possible.

After building our DAG, we can treat it as any other estimator:

from sklearn import datasets

X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size=0.2, random_state=0
)

dag.fit(X_train, y_train)
dag.predict(X_test)

Just like a pipeline, you can optimise it with a gridsearch, pickle it etc.

Note that this package does not deal with things like delayed dependencies and distributed architectures - consider an established solution for such use cases. skdag is just for building and executing local ensembles from estimators.

Read on to learn more about skdag...

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.circleci		.circleci
doc		doc
examples		examples
img		img
skdag		skdag
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
appveyor.yml		appveyor.yml
environment.yml		environment.yml
requirements.txt		requirements.txt
requirements_doc.txt		requirements_doc.txt
requirements_full.txt		requirements_full.txt
requirements_test.txt		requirements_test.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skdag - A more flexible alternative to scikit-learn Pipelines

About

Releases 5

Packages

Languages

License

scikit-learn-contrib/skdag

Folders and files

Latest commit

History

Repository files navigation

skdag - A more flexible alternative to scikit-learn Pipelines

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages