-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_virtual_dataset fails to open tiffs #291
Comments
Ah sorry about this @elyall ! We do have some rudimentary tests for tiff but as the tiff reader is just a thin wrapper around the kerchunk tiff reader I (naively) expected this to "just work". I agree that we should remedy that with better test coverage. Thanks for the reproducible example, I'll have a look at what might be going on now. |
So I've looked into this, and I'm a bit confused by what kerchunk returns, and am ignorant of exactly what information is stored in a tiff file. The kerchunk tiff reader basically returns this information about your example tiff file:
VirtualiZarr's tiff reader then attempts to parse this and wrap it up into a nicer What confuses me here is how am I supposed to know the name of the array you created? Do tiff files have named arrays/variables? Is the expected behaviour to default to some naming convention? Examples I find online of using rioxarray and rasterio are for opening GeoTIFF files - do they have more information in them about variable names? This is related to the failure you're seeing, because the virtualizarr parsing logic is expecting kerchunk to return something that has information about group names and array names. Also is saying that the array dimensions are called |
They do not. there is just a single array, and sometimes overviews. |
I've opened #292 to add your example as a test case and start trying to fix it @elyall.
Interesting... So if it's just one unnamed array, then "open_dataset" doesn't even make sense - the tiff reader should really only support EDIT: My suggestion seems consistent with what |
Yes a dataarray is a better fit. I just need the ManifestArray object. |
Cool. I'll be able to continue this next week, but you're welcome to work on it instead of you're interested. |
There can be multiple arrays, including overviews. In fact before GeoTIFF had explicit overviews the multiple arrays (subdatasets in GDAL parlance) were used for overviews in some software (maybe only one tho ...). I'm interested to explore this so hoping I can help a little 🙏 |
@mdsumner I'm not familiar with GeoTIFF but I'm guessing it's similar to other TIFF specs I'm aware of: a multi-page TIFF with specific metadata. Readers like |
Does multi-page mean (multi)subdataset analogously to multi-variable in NetCDF? I'm not familiar with the tiff spec, mostly only with GeoTIFF usage details and the abstractions "Pages" could be like bands, or layers, like actual pages in a pdf where the shape doesn't change, nor the pixel type. Sadly format idioms leak into these abstractions, so please indulge my questions I expect you'll have the same needs for clarification 🙏 |
Thanks both of you.
This was going to be my question too. In Xarray terminology what we need to know is whether each page of a multi-page TIFF maps to
I'm keen to try and get this working, but it would be much appreciated if someone who actually works with (Geo)TIFF files regularly was interesting in helping maintain VirtualiZarr's TIFF reader. The rest of the package is designed to be as modular as possible to minimize the effort required, and we already have some developers who have helpfully contributed and taken ownership over other specific readers (e.g. DMR++). cc'ing @scottyhq and @sharkinsspatial as people who may be interested, but any lurkers feel free to pipe up 🪈 |
I've only used option 1 where all pages are slices in one array and therefore have matching dimensions, but that is not always the case. Here's a great overview of internal TIFF structure. The Pyramid TIFF is an example of arrays of different sizes (though same dimensions) that would require separate keys (in this case a DataTree would work, example here). As we can see it's complicated since from the underlying structure there are a lot of different TIFF specs that require their own data loaders. All IFD and SubIFD ImageBytes should be individual ManifestArrays. The one consistent thing is they are sequential so as a start the keys/schema could be input as an argument. From there individual use cases could be handled as desired. I'm happy to help and potentially maintain a limited scope reader (TIFF spec itself shouldn't change much) but not all of the many edge cases. |
I'm getting the same error still as the OP ?? import virtualizarr
virtualizarr.__version__
'1.1.1.dev12+g9e7d430'
from pathlib import Path
import numpy as np
from PIL import Image
from virtualizarr import open_virtual_dataset
from virtualizarr.backend import FileType
def write_random_tiff(file_path):
array = np.random.randint(0, 255, (128, 128), dtype=np.uint8)
img = Image.fromarray(array)
img.save(file_path)
file_path = Path.cwd() / "test.tif"
write_random_tiff(file_path)
open_virtual_dataset(str(file_path), filetype=FileType.tiff)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/backend.py", line 190, in open_virtual_dataset
vds = backend_cls.open_virtual_dataset(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/readers/tiff.py", line 48, in open_virtual_dataset
refs = extract_group(refs, group)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py", line 46, in extract_group
raise ValueError(
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of [] I'm really keen to explore this, but having issues getting a working environment. will try again later :) |
I got a little ahead of myself as kerchunk's It would be great if we had an example TIFF file that has multiple arrays that someone understands well to cover the Dataset route. I found this sample GeoTIFF but it appears to be a single array. I could generate a multiscale OME-TIFF file but the keys are sequential and the data fits better in a DataTree than a Dataset. |
there's a multi tiff here: https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif gdalinfo /vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
Driver: GTiff/GeoTIFF
Files: /vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
Size is 20, 20
Image Structure Metadata:
INTERLEAVE=BAND
Subdatasets:
SUBDATASET_1_NAME=GTIFF_DIR:1:/vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
SUBDATASET_1_DESC=Page 1 (20P x 20L x 1B)
SUBDATASET_2_NAME=GTIFF_DIR:2:/vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
SUBDATASET_2_DESC=Page 2 (20P x 20L x 1B)
Corner Coordinates:
Upper Left ( 0.0, 0.0)
Lower Left ( 0.0, 20.0)
Upper Right ( 20.0, 0.0)
Lower Right ( 20.0, 20.0)
Center ( 10.0, 10.0)
Band 1 Block=20x20 Type=Byte, ColorInterp=Gray
|
Oh |
Okay I tried an using an approach like from tifffile import imread
store = imread(filepath, aszarr=True)
# TODO exception handling for TIFF files with multiple arrays
za = zarr.open_array(store=store, mode="r") but ended up down a rabbit hole because a) EDIT: I posted where I got to in #295 in case it's ever of any use. I think we should try to get the tiff reader working using an implementation that relies on the kerchunk reader first, then as a follow-up think about if we can eliminate kerchunk from that process. |
It does not at the moment.
That's correct, and
@mdsumner That file has two arrays with the same shape and dimensions.
Sounds good. Right now I can help most with developing tests but I'll take a look at some of the other VirtualiZarr readers and try to get a grasp of what they ultimately produce. |
I think that's fine, multiple chunks, I think that's how they're intended to be interpreted, not as different arrays.(It's very old style) |
Thanks both. I put my attempt to bypass kerchunk up in #295. I've also submitted #296 to change the documentation to no longer falsely advertise that we can read tiff files when we cannot yet.
It really shouldn't be particularly difficult to finish #292 - it mostly requires rejigging the |
I've done that here. The only real change was to have The question is, do you want to create a |
#297 looks good, thank you @elyall !
Yeah good question. The main advantage of coercing a single-image TIFF into a
We could image providing a new overloaded opening function like def open_tiff(filepath: str) -> Dataset | DataArray | DataTree:
... but I don't really like that idea either, because it's likely the next line of code just breaks depending on which type was returned, and it breaks the correspondence with (Note that we have basically the same problem with not knowing if GRIB/netCDF files map to So I think I prefer that we add |
open_virtual_dataset
says it supports TIFF files however it currently fails and is lacking test coverage. The issue starts here where the kerchunk reference to a zarr array is treated as an HDF reference that has one or more groups.Code to reproduce
Throws:
virtualizarr = 1.1.0
The text was updated successfully, but these errors were encountered: