Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add notebook with conversion examples #67

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

erindiel
Copy link

This notebook summarizes available conversion tools that rely on Bio-Formats for reading and writing various file formats. Specifically, it includes sample commands for converting using bfconvert and bioformats2raw, as well as a description of scenarios where one tool might be preferred over the other.

Sample data comes primarily from IDC. There are examples of both reading and writing DICOM. Are there other preferred datasets or methods for getting this data than what is used here?

This notebook can be run in Google Colab or it can be run locally; however, commands like wget will not work on Windows, so some sections will not be testable locally by Windows users. Is this acceptable?

cc @melissalinkert @dclunie @fedorov

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@fedorov
Copy link
Member

fedorov commented Aug 8, 2024

@erindiel @melissalinkert I started my review, and did some minor improvements to simplify access to data from IDC. You can find my edits here: https://colab.research.google.com/drive/1gkJpKr1cL5R4uEQkQtFE0UPGxtiJHXk_?usp=sharing.

Overall, the structure looks great! I have few minor comments, but I first wanted to bring up the issue that I think is a major one. The cells corresponding to conversion from DICOM to alternative representation are extremely slow.

This one was 48 minutes for a single H&E slide on a default Google Colab CPU instance.

image

The next cell has been running around that same time and is still not finished.

Can you comment on why this is so slow and what can be done about this? Is ome/bioformats#4190 going to remedy this?

@fedorov
Copy link
Member

fedorov commented Aug 8, 2024

The following one took almost 2 hours!

image

@melissalinkert
Copy link

Thanks, @fedorov. We're looking into the performance issue, as that seems to be noticeably slower than what we saw when originally testing.

ome/bioformats#4190 is expected not to affect conversion with bioformats2raw/raw2ometiff - that set of changes is around expanding access to "precompressed" tiles, which the bioformats2raw/raw2ometiff conversion workflow cannot currently make use of.

@erindiel
Copy link
Author

Thanks again @fedorov for noting the conversion time issue. We confirmed that when testing the notebook locally, the conversion took <10 minutes, even when lowering the max worker count using --max-workers. We therefore assume the I/O speeds on Google Colab are slower, increasing the conversion time dramatically.

A couple of options to improve the situation:

Copy link
Member

@fedorov fedorov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to run this on linux, I got this error... Maybe it is because of limited disk space, not sure. I will try again a bit later.

image

],
"source": [
"# IDC supports image download via s5cmd\n",
"!pip install s5cmd\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, do this:

# idc-index is a convenience package to support access to IDC data
!pip install idc-index --upgrade

],
"source": [
"# Download sample data from IDC\n",
"!s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com cp \"s3://idc-open-data/6d7f4ec7-2c84-4a46-86ac-acde279195bb/*\" rgb-dicom\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, do this:

!idc download-from-selection --series-instance-uid 1.3.6.1.4.1.5962.99.1.3140643155.174517037.1639523215699.2.0 --download-dir ./rgb-dicom --dir-template ""

This will report the total size and check that you have enough size, and will report download progress.

"id": "BSgOKNYErcGz"
},
"source": [
"### Install required packages"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the other notebook, I would install all prerequisites in a dedicated cell in the beginning of the notebook.

@DanielaSchacherer
Copy link
Contributor

Hi Erin, also for this notebook, Andrey asked me to have a look.
I think it's a very useful notebook for everyone that might have questions about conversion tools (I looked at the version where @fedorov already made some edits). I can confirm the running times he experienced in Colab (even a little longer) and would also suggest to take a small slide for exemplary use as well as add in the text that this is not something supposed to be run in Colab for a whole dataset.
Apart from that, I would not push to the repository including the output (except for the two images close to the end of the notebook).

…xample files (#2)

* use idc-index and smaller files for conversion examples

* clarify compression option

* use bio-formats 8.0.0 for precompressed option
@erindiel
Copy link
Author

Thanks for the reviews here. The notebook has been updated to:

  1. use idc-index for image download
  2. use smaller example files from IDC
  3. recommend local conversion for larger data
  4. use the latest Bio-Formats 8.0.0 with improvements in -precompressed option for bfconvert (compression type no longer needs to be specified): https://www.openmicroscopy.org/2024/10/24/bio-formats-8-0-0.html
  5. bioformats2raw commands use --compression zlib to avoid installation issues with blosc
  6. outputs of each block are removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants