-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX/ENH/RF: Introduce configuration module and catch lingering tmpdirs #466
Conversation
This config module improves accessibility of run-level options across the various modules of the heudiconv workflow. During this "enh-factor", the legacy ``TempDirs`` class was removed, and all temporary directories were changed into ``tempfile.TemporaryDirectory``.
Codecov Report
@@ Coverage Diff @@
## master #466 +/- ##
==========================================
+ Coverage 76.13% 76.48% +0.35%
==========================================
Files 37 38 +1
Lines 2962 3024 +62
==========================================
+ Hits 2255 2313 +58
- Misses 707 711 +4
Continue to review full report at Codecov.
|
love is always much appreciated. in the future, this may also be a good codebase to move over to pydra and have the dcm2niix interface be a pydra task. |
I'll open an issue about this. EDIT: It's #467. |
Definitely, our nipype usage is extremely minimal, I think just the |
Re nipype - I thought we had a dedicated issue to just get that dependency removed in favor of straight execution of dcm2niix but I did only find even older #6 |
If there a chance to
It is a large PR to review as a whole with two aspects interleaved and not entirely clear config intent |
We definitely could do this, but this largely defeats the whole motivation behind the refactoring. Previously, there was no true run-level place where standalone functions could send information to. And as a result, we were not properly removing temporary directories generated at: Lines 52 to 69 in ca402a9
by adding
It's largely inspired from work introduced in fmriprep early this year, with the omission of writing out a file with all arguments used. Though I might be inclined to bring that in and dump them inside That said, I've only tested this via the supplied tests, so I am hesitant to push for a merge until this goes through a more rigorous trial (any testing on your side would be super helpful 😄 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made a pass, primarily over freshly introduced "config" component.
... Though I might be inclined to bring that in and dump them inside /.heudiconv. One future change I've been wanting to see is reducing the number of arguments that need to be passed across all the various functions - the config would allow setting them once and then fetching them via config.<workflow/other>..
I would be very fond of supporting dumping (if not present) and/or first loading from if already present (and overriding with overrides from cmdline). Also, not necessarily implementing while working on this PR but just making it possible to do in the future, keep in mind possible future extensions mentioned in #198 (comment) such as:
- heuristic should allow to specify some settings (such as
--bids
) - heuristic might want to have its own config settings (could be via adhoc
<heuristicname>_<configname>
or some explicitheuristics.<heuristicname>.<configname>
hierarchical structure).
With that in mind (to make it possible to have per "dataset" settings), and also as a general programming style (avoidance of singleton class with bound attributes without ability to list documentation for them etc), I think use of something like attrs
python module and making config into a proper class which could be instantiated would help to add support for "per dataset" settings. ATM heudiconv
can be converting different sessions it finds into different output directories (datasets), so having a singular instance of config would prevent/complicate any overloading from "dataset specific" configurations.
It doesn't need to be attrs
-- may be there is some python module already for configuration management, may be even with "binding" to argparse
so may be then we could avoid duplicate docstrings for cmdline and configuration variables etc.
if k in cls._paths: | ||
setattr(cls, k, Path(v).absolute()) | ||
continue | ||
if k in cls._multipaths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't then _multipath
be also assigned at the class level? Please add comments along on their purpose and difference between (_paths
and _multipaths
) -- would become irrelevant if "semantics" are defined per each attribute as suggested below
if k in cls._multipaths: | ||
setattr(cls, k, tuple(Path(i).absolute() for i in v)) | ||
if hasattr(cls, k): | ||
setattr(cls, k, v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and if not - we just completely ignore the setting without even a warning? target use-case?
|
||
if init: | ||
try: | ||
cls.init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add empty init
to the base class then and describe its purpose. if subclass MUST define init
(seems to be not the case, may be we should just start using abc
and declare init
as @abc.abstractmethod
. Altogether - I would really like to avoid try/except here since it could mask bugs in init
code, and interface of this base _Config
should provide sufficient information about what subclasses should overload
if k in cls._paths: | ||
v = str(v) | ||
if k in cls._multipaths: | ||
v = tuple(str(i) for i in v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be some sanity check that there is no duplicate entries in _paths
and _multipaths
? -- becomes irrelevant if "semantics" as encapsulated as suggested in other comment
|
||
dicom_dir_template = None | ||
"Template to search one or more directories for DICOMs and tarballs." | ||
files = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall it would be better to avoid splitting "semantics" from the field and _multipaths
(and _paths
).
Why not to declare dummy classes NonePath
and NoneMultiPath
and use them to annotate fields , and then make _paths
and _multipath
populated in top level __init__
based on the values it finds among class level attributes? That would help to avoid bugs whenever some attribute is renamed/introduced but _*paths
is not adjusted accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, there seems to be no way to "access" these "variable docstrings":
$> cat /tmp/1.py
class bu:
at = None
"""at docstring"""
print(bu.at.__doc__)
$> python3 /tmp/1.py
None
so we wouldn't be able to provide user-friendly dump of them etc. Why not to provide some encapsulation?
NB may be even just use https://www.attrs.org/en/stable/ for this purpose (and switch away from using static single class and @classmethod
)s? If central instance is needed, it could be instantiated at the top level module. FWIW, we do smth like that for datalad: https://github.com/datalad/datalad/blob/master/datalad/__init__.py#L47 where we instantiate "global config" but still allowing per-dataset configurations which might override some settings. This provides flexibility where necessary.
return out | ||
|
||
|
||
class workflow(_Config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be make it into workflow_config
? Even though it is config.workflow
, whenever type etc would be printed, module name would not be included and thus could lead to a confusion with some real "workflow" (not just config)?
_init_seed(cls.random_seed) | ||
|
||
|
||
def from_dict(settings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this one is "workflow" specific... should then be ideally bound to the class one way or another (I know -- in current implementation it would conflict with definition of attributes/options straight at the class level - see my comment about attrs
)
try: | ||
_latest_version = res.json()['version'] | ||
except Exception: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why direct calls via requests
and not use of etelemetry
module client? by all means we want to avoid reimplementing what it does/created for.
|
||
def __init__(self): | ||
"""Avert instantiation.""" | ||
raise RuntimeError('Configuration type is not instantiable.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not to start using abc
and thus declare this as an abc.abstractmethod
? according to https://docs.python.org/3/library/abc.html#abc.abstractmethod could be applied to @classmethod
as well
if not flat: | ||
return settings | ||
|
||
return {'.'.join((section, k)): v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: seems to be not covered by the tests.
@mgxd would you have time/motivation to bring this over the finish line? |
Looking back after dropping this, I think this can probably be closed and the relevant bits extracted into a fresh PR. I will admit I'm not up to date on the expected way to use However, I do think I'll carve out some time to ensure we handle the removal of temporary directories |
Spawned while fiddling with #462.
This config module improves accessibility of arguments across the various modules of the heudiconv workflow. Since
heudiconv
started as a single, massive script, the implementation favored run-level access to all / most variables.During this "enh-factor", the legacy
TempDirs
class was removed, and most temporary directories were changed intotempfile.TemporaryDirectory
instances, with the exception being the tmpdir housing the extracted tarball files. This is tracked directly by the config class, and should catch any leaked directories from lingering past heudiconv's execution.P.S. - We are in desperate need of standardizing the coding style here - if/when this is and any other almost-ready PRs are merged, I'm inclined to run an autolinter (preferably
black
) and give the codebase some needed love.