How are datasets organized in .tar format? #59

shuo-yan20 · 2024-07-04T10:17:11Z

Thank you for your excellent work！

In metaclip/pipeline.py, I find the the function shard_text_loader parsing the .tar format data, including finding .jpeg and .json. I want to kown how these .tar data were organized, and why image data of .jpeg has been downloaded before sub_matching?

Thanks very much!

The text was updated successfully, but these errors were encountered:

howardhsu · 2024-08-15T16:28:32Z

it's supposed to be similar as webdataset.
To allow 100% transparency, our sample dataloader reads it via regular python tar api, the tar file is organized as <dataset_dir>/{shard_id % 100}/{shard_id}.tar.

Each tar contains files in the following order:

     uuid1.json
     uuid1.jpeg
     uuid2.json
     uuid2.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are datasets organized in .tar format? #59

How are datasets organized in .tar format? #59

shuo-yan20 commented Jul 4, 2024

howardhsu commented Aug 15, 2024

How are datasets organized in .tar format? #59

How are datasets organized in .tar format? #59

Comments

shuo-yan20 commented Jul 4, 2024

howardhsu commented Aug 15, 2024