Datalad integration concept #75

bpoldrack · 2022-09-14T13:50:03Z

This is a write-up of what @FeHoff and I figured so far.

The core of it is of course to wrap a junifer run call in a datalad run call capturing the results in a dataset. That would produce a re-executable (via datalad rerun) record in the commit message that looks like this:

commit 50199fc2c6cff20882c42a6f1d5ef5ecd5e44092 (HEAD -> master)
Author: Felix Hoffstaedter <[email protected]>
Date:   Wed Sep 14 14:42:20 2022 +0200

    [DATALAD RUNCMD] test ukb -element
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "junifer run ukb_gmd_mean.yaml --element \"1024353,2\"",
     "dsid": "d1fe22b6-1310-4ff0-a51e-e905c63b79aa",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

The yaml file would be saved to the result dataset before that execution. Hence, the record refers to specific version of that yaml file.
The input data would be added as a subdataset (also before execution obviously).

For any input dataset there could then be a "central" superdataset, that would collect such result datasets as subdatasets (could be done via pull request), hence enabling discovery of what is already available. That would be an optional step, though.

Such a superdataset could also become the actual entry point for junifer users:

clone that superdataset, add a new subdataset for the results, put the yaml in it, and clone the input dataset into it. Then datalad run "junifer run" from within the results dataset and finally datalad save in the superdataset.
This would allow for two things: Discovery of what already is available from the superdataset is a local operation + the final state of things can directly be turned into a PR to update the superdataset with the new results.

The datalad run execution can be hidden in a dedicated junifer command (or an option to junifer run). Say, junifer run --datalad would then internally call datalad run "junifer run".

For junifer queue would need a RIA store to be given. It would then push the prepared dataset hierarchy into that store and submit a job that clones the entire thing from the store, switch to a job specific branch, runs junifer run --datalad and pushes back to store afterwards. A merge of the job branches would be needed when all those branches are back in store (see also FAIRly big workflow)

This would imply minor changes to the DataladDataGrabber:

__enter__ would need to clone the input dataset into the result dataset (datalad clone -d path/to/result/dataset URL DEST or datalad.api.Dataset("path/to/result/dataset").clone(URL, DEST))
__exit__ would then just uninstall the input dataset (we don't want to change/delete the reference, but simply not have it locally anymore)

So, the entire structure would look something like this:

super
├── other_result
│   ├── db
│   │   ├── element1.db
│   │   └── element2.db
│   ├── input
│   └── other-junifer-pipe.yml
└── some_result
    ├── db
    │   ├── element1.db
    │   └── element2.db
    ├── input
    └── my-junifer-pipe.yml

Where, super, some_result, other_result and the input's are datasets. Note, that the inputs are supposed to be the same (dataset). Reminder that this is not a duplication. A subdataset only is a reference per se. But this also means, that every result dataset is self-contained - super is only there for discovery.

There's a bunch of question to decide upon, though:

Does junifer run --datalad expect the result dataset (and possibly its super) to be there already or should it take care of creating it itself?
All of this would require workdir, datadir, the storage's uri etc to point into that result dataset. Which piece of code (if any) would be best to make sure of that? To what extend would junifer want to enforce such things?
There's a general issue with reproducibility (by means of datalad rerun) and that is absolute paths in the YAML. I suppose the defined workdir is largely there to determine the workdir of a compute job in junifer queue. It seems to me that it should rather be an option to junifer queue then rather than a (committed) entry in the YAML. datadir and the storage uri need to be relative to the result dataset's root.
Should junifer rely on a superdataset referencing all result subdatasets? As in: It must already exist and hence be cloned before creating the result dataset in it? Or should that be optional? That would imply it cannot run in this datalad fashion until someone made such a superdataset available, while w/o datalad such a thing would be possible. I think it should be flexible in that regard but that means to think about how to specify both cases.
What happens after jobs finished? Should there be a fetch (from the store the compute jobs pushed to) and merge helper/command? Merging can be a bit messy at scale (couple of thousand branches).

The text was updated successfully, but these errors were encountered:

synchon added the enhancement New feature or request label Sep 14, 2022

fraimondo added documentation Improvements or additions to documentation question Further information is requested concept Design/Implementation concepts labels Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datalad integration concept #75

Datalad integration concept #75

bpoldrack commented Sep 14, 2022 •

edited

Loading

Datalad integration concept #75

Datalad integration concept #75

Comments

bpoldrack commented Sep 14, 2022 • edited Loading

bpoldrack commented Sep 14, 2022 •

edited

Loading