Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seriesgrouping #684

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Seriesgrouping #684

wants to merge 6 commits into from

Conversation

neurorepro
Copy link
Contributor

@neurorepro neurorepro commented Jun 7, 2023

This is to address #624

@neurorepro
Copy link
Contributor Author

You may want to consider #682 first (changes are included in this PR)

@codecov
Copy link

codecov bot commented Jun 7, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (e3c53b8) 81.94% compared to head (e9c9289) 81.98%.
Report is 43 commits behind head on master.

Files Patch % Lines
heudiconv/utils.py 73.68% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #684      +/-   ##
==========================================
+ Coverage   81.94%   81.98%   +0.04%     
==========================================
  Files          41       41              
  Lines        4132     4153      +21     
==========================================
+ Hits         3386     3405      +19     
- Misses        746      748       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@neurorepro
Copy link
Contributor Author

@yarikoptic As discussed in #624 I implemented a hash to shorten the name, but the full Series UID could be used too

@yarikoptic
Copy link
Member

In general I think the approach makes sense in general. Re why not before: I guess (@satra and @mih might correct if have any recoll of the motivations) is that series id and name (instead of UID) were chosen for human accessibility since they make it easier to understand which particular series in question while annotating .edit.txt and just debugging the operation. So far I do not see immediate problem with adding series UID since AFAIK indeed it should be the same for all files within a single acquisition. Might be worth checking though on more of combined ones like T1PD etc.

TL;DR: we need to approach it with caution and thus need more checking/work. Some reservations/concerns and thinking out loud:

  • the fact that we ran into it just now suggests that situation is not typical thus questions the change of the default behavior:
    • It seems it would apply ONLY in the case of -g all or -g accession i.e. where we force multiple scanning sessions into a single conversion study. That would partially explain why we have not ran into it I guess. Such grouppings were added only few years back IIRC and aren't default mode of operation
    • as you note in Fieldmaps acquired during each part of a session splits into two not differentiated #624 (comment) dcm2niix would still do "the right thing" and convert those incorrectly "groupped" files into different files since it would rely on series UUID. So it is just that neither a heuristic nor heudiconv would know how to tease them apart.
  • it would introduce the core change of behavior
    • possibly making it then impossible for heudiconv to operate on prior produced annotations in -c none mode for the subsequent conversion with -c dcm2niix since all groups names would change. Needs a check:
    • may be heudiconv needs to become smart and detect in the mode of work while reloading already existing mapping file and then not bother adding UID? might help in transition period but generally not needed/too cumbersome... may be we would just add a check while loading an existing mapping that all the series ids do include that UID hash and error out if they don't with informative message. I think the latter is better.
    • but it also might break some heuristics, which might have relied on those series ids (did any which we ship? needs to be checked)
    • it could also be made optional - either via command line option (more preferable) or automagicly enabled only in -g all or -g accession (I don't like that). But then it adds differences in behavior based on CLI options, not good

heudiconv/dicoms.py Outdated Show resolved Hide resolved
@neurorepro
Copy link
Contributor Author

@yarikoptic thank you very much for the feedback

Regarding scope of use case: i think it is actually common to have artificially split a single BIDS session into two when the scanning protocol is too long (e.g. AM: anat, resting state, ... PM: diffusion MRI, ...) or the subject needs to exit. It is actually mentioned explicitely in the BIDS definition of session (def number 5):

if a subject has to leave the scanner room and then be re-positioned on the scanner bed, the set of MRI acquisitions will still be considered as a session and match sessions acquired in other subjects. Similarly, in situations where different data types are obtained over several visits (for example fMRI on one day followed by DWI the day after) those can be grouped in one session.

The fact that not many people reported it i think is because users did not know what went wrong and would not think this may be due to heudiconv (who would think of blaming such a magnificent tool ?)

Regarding change of behavior i totally agree that this should be done carefully. I think what seems to be your preferred option would be best:

may be we would just add a check while loading an existing mapping that all the series ids do include that UID hash and error out if they don't with informative message. I think the latter is better.

I will update the PR with that implemented.

If the error is too much for users (for which the use case of needing to keep previous data without uid is justified) then maybe we could add later a temporary option --with-deprecated-series-id ?

@neurorepro
Copy link
Contributor Author

neurorepro commented Jun 16, 2023

@yarikoptic I modified the code to check that the conversion table does not use deprecated series IDs. Otherwise it recomputes the conversation table with the non-deprecated series IDs.

I think in the end this may be less frustrating than an error since it identifies the deprecated IDs and recomputes the conversion table accordingly.

I also set the hash length to 8 as you suggested.

@neurorepro
Copy link
Contributor Author

Hello @yarikoptic does it look like the code is ok now ? Cheers !

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic I modified the code to check that the conversion table does not use deprecated series IDs. Otherwise it recomputes the conversation table with the non-deprecated series IDs.

I worry that it is a bit too intrusive, in my comment above I thought it might be better to be more polite and not to just abolish the work people might have done already by just saying that "now we need another way". But Ok, I guess I can swallow this, we would just announce some "major"ish release.

But one major concern about the PR is absent "setting in stone" for such new behaviors in tests: not a single test was changed or added, thus suggesting actual problem of us not bothering to test that behavior before. Could you please add/change some test so it does get duplicated seqinfo (I guess we could just copy some subfolder with dicoms into a new one thus causing duplicates) and seeing it fail?

I also left some comments on how to possibly shorten that function to determine if outdated using re -- can become just 1 liner. But we would need to have a test either way.

"""Return True if any series_id in the info_file is in deprecated format."""
info = read_config(info_file)
series_ids = [series_id for sublist in info.values() for series_id in sublist]
if any(is_deprecated_seriesid(series_id) for series_id in series_ids):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be shorter

Suggested change
if any(is_deprecated_seriesid(series_id) for series_id in series_ids):
if any(map(is_deprecated_seriesid, series_ids)):

but it is ok either way.

Comment on lines +253 to +254
"Found deprecated series identifier in info file in the format"
"<series num>-<protocol name> instead of <series num>-<protocol name>-<UID hash>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Found deprecated series identifier in info file in the format"
"<series num>-<protocol name> instead of <series num>-<protocol name>-<UID hash>"
"Found deprecated series identifier in info file in the format "
"<series num>-<protocol name> instead of <series num>-<protocol name>-<UID hash>. "

Comment on lines +241 to +244
if len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex):
return True
# If all previous tests passed, then the series_id is not deprecated
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex):
return True
# If all previous tests passed, then the series_id is not deprecated
return False
return len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or may be better to use re (might need an import)

Suggested change
if len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex):
return True
# If all previous tests passed, then the series_id is not deprecated
return False
return re.match(f"[{string.hexdigits}]{{{len_hash_hex}}}$")

this function needs a unittest which would be trivial to do

Comment on lines +232 to +244
if series_id.count('-') <= 1:
return True
# Check the first part of the series_id is a number
series_number = series_id.split('-')[0]
if not series_number.isdigit():
return True
# Check the last part of the series_id is a hash in hexadecimal format of length len_hash_hex
hash_hex = series_id.split('-')[-1]
hex_digits = set(string.hexdigits)
if len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex):
return True
# If all previous tests passed, then the series_id is not deprecated
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall can be

Suggested change
if series_id.count('-') <= 1:
return True
# Check the first part of the series_id is a number
series_number = series_id.split('-')[0]
if not series_number.isdigit():
return True
# Check the last part of the series_id is a hash in hexadecimal format of length len_hash_hex
hash_hex = series_id.split('-')[-1]
hex_digits = set(string.hexdigits)
if len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex):
return True
# If all previous tests passed, then the series_id is not deprecated
return False
return not re.match(f"[{string.digits}]+-.+-[{string.hexdigits}]{{{len_hash_hex}}}$")

to just match regex... again -- just needs a unittest

series_number = series_id.split('-')[0]
if not series_number.isdigit():
return True
# Check the last part of the series_id is a hash in hexadecimal format of length len_hash_hex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check the last part of the series_id is a hash in hexadecimal format of length len_hash_hex
# Check the last part of the series_id is a hash in hexadecimal format of length len_hash_hex
# Note that protocol_name could contain -

is that correct and that is why [-1]?

@yarikoptic
Copy link
Member

@neurorepro ping on above comments.

@neurorepro
Copy link
Contributor Author

Yes totally for the tests. I actually tested it on two datasets (already having old data, and newly converted) and it went fine. But indeed tests need to be included for continuous testing.

I'll get back to you!

@yarikoptic yarikoptic added the minor Increment the minor version when merged label Jan 16, 2024
@yarikoptic
Copy link
Member

a gentle ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants