Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datalad integration concept #75

Open
bpoldrack opened this issue Sep 14, 2022 · 0 comments
Open

Datalad integration concept #75

bpoldrack opened this issue Sep 14, 2022 · 0 comments
Labels
concept Design/Implementation concepts documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested

Comments

@bpoldrack
Copy link
Contributor

bpoldrack commented Sep 14, 2022

This is a write-up of what @FeHoff and I figured so far.

The core of it is of course to wrap a junifer run call in a datalad run call capturing the results in a dataset. That would produce a re-executable (via datalad rerun) record in the commit message that looks like this:

commit 50199fc2c6cff20882c42a6f1d5ef5ecd5e44092 (HEAD -> master)
Author: Felix Hoffstaedter <[email protected]>
Date:   Wed Sep 14 14:42:20 2022 +0200

    [DATALAD RUNCMD] test ukb -element
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "junifer run ukb_gmd_mean.yaml --element \"1024353,2\"",
     "dsid": "d1fe22b6-1310-4ff0-a51e-e905c63b79aa",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [],
     "outputs": [],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

The yaml file would be saved to the result dataset before that execution. Hence, the record refers to specific version of that yaml file.
The input data would be added as a subdataset (also before execution obviously).

For any input dataset there could then be a "central" superdataset, that would collect such result datasets as subdatasets (could be done via pull request), hence enabling discovery of what is already available. That would be an optional step, though.

Such a superdataset could also become the actual entry point for junifer users:

clone that superdataset, add a new subdataset for the results, put the yaml in it, and clone the input dataset into it. Then datalad run "junifer run" from within the results dataset and finally datalad save in the superdataset.
This would allow for two things: Discovery of what already is available from the superdataset is a local operation + the final state of things can directly be turned into a PR to update the superdataset with the new results.

The datalad run execution can be hidden in a dedicated junifer command (or an option to junifer run). Say, junifer run --datalad would then internally call datalad run "junifer run".

For junifer queue would need a RIA store to be given. It would then push the prepared dataset hierarchy into that store and submit a job that clones the entire thing from the store, switch to a job specific branch, runs junifer run --datalad and pushes back to store afterwards. A merge of the job branches would be needed when all those branches are back in store (see also FAIRly big workflow)

This would imply minor changes to the DataladDataGrabber:

  • __enter__ would need to clone the input dataset into the result dataset (datalad clone -d path/to/result/dataset URL DEST or datalad.api.Dataset("path/to/result/dataset").clone(URL, DEST))
  • __exit__ would then just uninstall the input dataset (we don't want to change/delete the reference, but simply not have it locally anymore)

So, the entire structure would look something like this:

super
├── other_result
│   ├── db
│   │   ├── element1.db
│   │   └── element2.db
│   ├── input
│   └── other-junifer-pipe.yml
└── some_result
    ├── db
    │   ├── element1.db
    │   └── element2.db
    ├── input
    └── my-junifer-pipe.yml

Where, super, some_result, other_result and the input's are datasets. Note, that the inputs are supposed to be the same (dataset). Reminder that this is not a duplication. A subdataset only is a reference per se. But this also means, that every result dataset is self-contained - super is only there for discovery.

There's a bunch of question to decide upon, though:

  • Does junifer run --datalad expect the result dataset (and possibly its super) to be there already or should it take care of creating it itself?
  • All of this would require workdir, datadir, the storage's uri etc to point into that result dataset. Which piece of code (if any) would be best to make sure of that? To what extend would junifer want to enforce such things?
  • There's a general issue with reproducibility (by means of datalad rerun) and that is absolute paths in the YAML. I suppose the defined workdir is largely there to determine the workdir of a compute job in junifer queue. It seems to me that it should rather be an option to junifer queue then rather than a (committed) entry in the YAML. datadir and the storage uri need to be relative to the result dataset's root.
  • Should junifer rely on a superdataset referencing all result subdatasets? As in: It must already exist and hence be cloned before creating the result dataset in it? Or should that be optional? That would imply it cannot run in this datalad fashion until someone made such a superdataset available, while w/o datalad such a thing would be possible. I think it should be flexible in that regard but that means to think about how to specify both cases.
  • What happens after jobs finished? Should there be a fetch (from the store the compute jobs pushed to) and merge helper/command? Merging can be a bit messy at scale (couple of thousand branches).
@synchon synchon added the enhancement New feature or request label Sep 14, 2022
@fraimondo fraimondo added documentation Improvements or additions to documentation question Further information is requested concept Design/Implementation concepts labels Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
concept Design/Implementation concepts documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants