Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX/ENH/RF: Introduce configuration module and catch lingering tmpdirs #466

Closed
wants to merge 6 commits into from

Conversation

mgxd
Copy link
Member

@mgxd mgxd commented Aug 6, 2020

Spawned while fiddling with #462.

This config module improves accessibility of arguments across the various modules of the heudiconv workflow. Since heudiconv started as a single, massive script, the implementation favored run-level access to all / most variables.

During this "enh-factor", the legacy TempDirs class was removed, and most temporary directories were changed into
tempfile.TemporaryDirectory instances, with the exception being the tmpdir housing the extracted tarball files. This is tracked directly by the config class, and should catch any leaked directories from lingering past heudiconv's execution.

P.S. - We are in desperate need of standardizing the coding style here - if/when this is and any other almost-ready PRs are merged, I'm inclined to run an autolinter (preferably black) and give the codebase some needed love.

This config module improves accessibility of run-level options across the various modules of the heudiconv workflow.
During this "enh-factor", the legacy ``TempDirs`` class was removed, and all temporary directories were changed into
``tempfile.TemporaryDirectory``.
@codecov
Copy link

codecov bot commented Aug 6, 2020

Codecov Report

Merging #466 into master will increase coverage by 0.35%.
The diff coverage is 90.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #466      +/-   ##
==========================================
+ Coverage   76.13%   76.48%   +0.35%     
==========================================
  Files          37       38       +1     
  Lines        2962     3024      +62     
==========================================
+ Hits         2255     2313      +58     
- Misses        707      711       +4     
Impacted Files Coverage Δ
heudiconv/tests/test_dicoms.py 100.00% <ø> (ø)
heudiconv/tests/test_utils.py 100.00% <ø> (ø)
heudiconv/utils.py 90.86% <ø> (+0.40%) ⬆️
heudiconv/dicoms.py 81.14% <82.05%> (ø)
heudiconv/main.py 73.45% <87.50%> (-2.74%) ⬇️
heudiconv/convert.py 84.71% <92.30%> (-0.20%) ⬇️
heudiconv/config.py 92.45% <92.45%> (ø)
heudiconv/cli/run.py 92.72% <100.00%> (+1.23%) ⬆️
heudiconv/parser.py 92.30% <100.00%> (+0.17%) ⬆️
heudiconv/tests/test_tarballs.py 100.00% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d855f64...eb93cb9. Read the comment docs.

@satra
Copy link
Member

satra commented Aug 6, 2020

give the codebase some needed love.

love is always much appreciated.

in the future, this may also be a good codebase to move over to pydra and have the dcm2niix interface be a pydra task.

@tsalo
Copy link
Member

tsalo commented Aug 6, 2020

P.S. - We are in desperate need of standardizing the coding style here - if/when this is and any other almost-ready PRs are merged, I'm inclined to run an autolinter (preferably black) and give the codebase some needed love.

I'll open an issue about this.

EDIT: It's #467.

@mgxd
Copy link
Member Author

mgxd commented Aug 6, 2020

in the future, this may also be a good codebase to move over to pydra and have the dcm2niix interface be a pydra task.

Definitely, our nipype usage is extremely minimal, I think just the dcm2niix wrapper and provenance tracking.

@yarikoptic
Copy link
Member

Re nipype - I thought we had a dedicated issue to just get that dependency removed in favor of straight execution of dcm2niix but I did only find even older #6

@yarikoptic
Copy link
Member

If there a chance to

  • get tempdir handling in a separate PR?
  • get some description of what config module is about (ie explained in PR description)?

It is a large PR to review as a whole with two aspects interleaved and not entirely clear config intent

@mgxd
Copy link
Member Author

mgxd commented Aug 7, 2020

get tempdir handling in a separate PR?

We definitely could do this, but this largely defeats the whole motivation behind the refactoring. Previously, there was no true run-level place where standalone functions could send information to. And as a result, we were not properly removing temporary directories generated at:

def get_extracted_dicoms(fl):
"""Given a list of files, possibly extract some from tarballs
For 'classical' heudiconv, if multiple tarballs are provided, they correspond
to different sessions, so here we would group into sessions and return
pairs `sessionid`, `files` with `sessionid` being None if no "sessions"
detected for that file or there was just a single tarball in the list
"""
# TODO: bring check back?
# if any(not tarfile.is_tarfile(i) for i in fl):
# raise ValueError("some but not all input files are tar files")
# tarfiles already know what they contain, and often the filenames
# are unique, or at least in a unqiue subdir per session
# strategy: extract everything in a temp dir and assemble a list
# of all files in all tarballs
# cannot use TempDirs since will trigger cleanup with __del__
tmpdir = mkdtemp(prefix='heudiconvDCM')

by adding heudiconv.config._tmpdirs, we can run some basic cleanup (similar to what TempDirs.__del__ did). LMKWYT

get some description of what config module is about (ie explained in PR description)?

It's largely inspired from work introduced in fmriprep early this year, with the omission of writing out a file with all arguments used. Though I might be inclined to bring that in and dump them inside <outdir>/.heudiconv. One future change I've been wanting to see is reducing the number of arguments that need to be passed across all the various functions - the config would allow setting them once and then fetching them via config.<workflow/other>.<var>.

That said, I've only tested this via the supplied tests, so I am hesitant to push for a merge until this goes through a more rigorous trial (any testing on your side would be super helpful 😄 )

heudiconv/utils.py Outdated Show resolved Hide resolved
heudiconv/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a pass, primarily over freshly introduced "config" component.

... Though I might be inclined to bring that in and dump them inside /.heudiconv. One future change I've been wanting to see is reducing the number of arguments that need to be passed across all the various functions - the config would allow setting them once and then fetching them via config.<workflow/other>..

I would be very fond of supporting dumping (if not present) and/or first loading from if already present (and overriding with overrides from cmdline). Also, not necessarily implementing while working on this PR but just making it possible to do in the future, keep in mind possible future extensions mentioned in #198 (comment) such as:

  • heuristic should allow to specify some settings (such as --bids)
  • heuristic might want to have its own config settings (could be via adhoc <heuristicname>_<configname> or some explicit heuristics.<heuristicname>.<configname> hierarchical structure).

With that in mind (to make it possible to have per "dataset" settings), and also as a general programming style (avoidance of singleton class with bound attributes without ability to list documentation for them etc), I think use of something like attrs python module and making config into a proper class which could be instantiated would help to add support for "per dataset" settings. ATM heudiconv can be converting different sessions it finds into different output directories (datasets), so having a singular instance of config would prevent/complicate any overloading from "dataset specific" configurations.

It doesn't need to be attrs -- may be there is some python module already for configuration management, may be even with "binding" to argparse so may be then we could avoid duplicate docstrings for cmdline and configuration variables etc.

if k in cls._paths:
setattr(cls, k, Path(v).absolute())
continue
if k in cls._multipaths:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't then _multipath be also assigned at the class level? Please add comments along on their purpose and difference between (_paths and _multipaths) -- would become irrelevant if "semantics" are defined per each attribute as suggested below

if k in cls._multipaths:
setattr(cls, k, tuple(Path(i).absolute() for i in v))
if hasattr(cls, k):
setattr(cls, k, v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if not - we just completely ignore the setting without even a warning? target use-case?


if init:
try:
cls.init()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add empty init to the base class then and describe its purpose. if subclass MUST define init (seems to be not the case, may be we should just start using abc and declare init as @abc.abstractmethod. Altogether - I would really like to avoid try/except here since it could mask bugs in init code, and interface of this base _Config should provide sufficient information about what subclasses should overload

if k in cls._paths:
v = str(v)
if k in cls._multipaths:
v = tuple(str(i) for i in v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be some sanity check that there is no duplicate entries in _paths and _multipaths? -- becomes irrelevant if "semantics" as encapsulated as suggested in other comment


dicom_dir_template = None
"Template to search one or more directories for DICOMs and tarballs."
files = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall it would be better to avoid splitting "semantics" from the field and _multipaths (and _paths).
Why not to declare dummy classes NonePath and NoneMultiPath and use them to annotate fields , and then make _paths and _multipath populated in top level __init__ based on the values it finds among class level attributes? That would help to avoid bugs whenever some attribute is renamed/introduced but _*paths is not adjusted accordingly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, there seems to be no way to "access" these "variable docstrings":

$> cat /tmp/1.py
class bu:
   at = None
   """at docstring"""

print(bu.at.__doc__)

$> python3 /tmp/1.py
None

so we wouldn't be able to provide user-friendly dump of them etc. Why not to provide some encapsulation?

NB may be even just use https://www.attrs.org/en/stable/ for this purpose (and switch away from using static single class and @classmethod)s? If central instance is needed, it could be instantiated at the top level module. FWIW, we do smth like that for datalad: https://github.com/datalad/datalad/blob/master/datalad/__init__.py#L47 where we instantiate "global config" but still allowing per-dataset configurations which might override some settings. This provides flexibility where necessary.

return out


class workflow(_Config):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be make it into workflow_config? Even though it is config.workflow, whenever type etc would be printed, module name would not be included and thus could lead to a confusion with some real "workflow" (not just config)?

_init_seed(cls.random_seed)


def from_dict(settings):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this one is "workflow" specific... should then be ideally bound to the class one way or another (I know -- in current implementation it would conflict with definition of attributes/options straight at the class level - see my comment about attrs)

try:
_latest_version = res.json()['version']
except Exception:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why direct calls via requests and not use of etelemetry module client? by all means we want to avoid reimplementing what it does/created for.


def __init__(self):
"""Avert instantiation."""
raise RuntimeError('Configuration type is not instantiable.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to start using abc and thus declare this as an abc.abstractmethod? according to https://docs.python.org/3/library/abc.html#abc.abstractmethod could be applied to @classmethod as well

if not flat:
return settings

return {'.'.join((section, k)): v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: seems to be not covered by the tests.

@yarikoptic
Copy link
Member

@mgxd would you have time/motivation to bring this over the finish line?

@mgxd
Copy link
Member Author

mgxd commented May 6, 2021

Looking back after dropping this, I think this can probably be closed and the relevant bits extracted into a fresh PR. I will admit I'm not up to date on the expected way to use heudiconv, so perhaps the singleton config option can be scrapped in favor of a more robust per-dataset approach. I probably won't get to this anytime soon though

However, I do think I'll carve out some time to ensure we handle the removal of temporary directories

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants