Confusion regarding construction of 400M dataset #24

varadgunjal · 2023-11-01T17:39:48Z

Based on the details provided in the MetaCLIP paper, I understand that the magic number t = 20k is a threshold used to limit the number of texts/pairs for each entry. Entries with fewer than t pairs (tail entries) retain all associated pairs, while entries with more than t pairs (head entries) are sub-sampled to t pairs.

It is also mentioned that there are ~16K entries in the 500K query list which have counts >= 20K and they account for 5.35B / 5.67B total matches. This implies that the remaining ~484K have counts < 20K and they account for 5.67B - 5.35B ≈ 320M matches. I have checked these numbers using the entry_counts_400m.json (https://github.com/facebookresearch/MetaCLIP/blob/main/metaclip/entry_counts_400m.json) file and they line up.

Based on these 2 pieces of information, I understand that we would take 20K pairs for each of the 16K entries that have counts >= 20K => ~20K * 16K ≈ 320M samples. And then add in all matches for the remaining 484K entries which from above = 320M. This gives a dataset of size 320M + 320M = 640M.

Is my understanding and the demonstrated calculation correct? And if, yes, how is this subsampled to 400M?

The text was updated successfully, but these errors were encountered:

howardhsu · 2023-11-08T03:32:57Z

thx for the confirming and checking the match of paper and released distribution.

Let me explain as the following:
(1) 400M is a (lucky) result of running curation, not the goal; the goal of curation is to ensure pool quantity/quality, not a resulting training set (which cannot precisely control though human wish to compare training scale);
(2) as a result, when we want to ablate on 400M (ablation purpose only), we need to increase pool size if resulting set is less than 400M and otherwise decrease. We did so by changing number of (uniformly shuffled) shards (we collect a bit more than needed) to precisely meet 400M. We run curation algorithm multiple times on different number of shards to estimate OpenAI CLIP's original pool size (which we don't know);
(3) now go to your math i wish to mention one important detail: one text has multiple matches of entries (so cannot infer "5.67B - 5.35B (counts of entry match) ≈ 320M (count of pairs/texts)"); multiple entries' 20K matches may point to the same pair out of 400M. We assume 400M has no duplicates (not mentioned by CLIP paper clearly but obviously...)
(4) (if our paper writing has some error/in-precise terms do let us know, thx).

varadgunjal · 2023-11-15T04:25:55Z

Thanks! This explanation makes sense. A few other follow-up questions -

(1) Was any simple text preprocessing / image preprocessing / deduplication done on the sanples from the CC shards before running curation on them?
(2) In substring matching, do you worry about matching case? Based on the code here -

MetaCLIP/metaclip/substr_matching.py

Line 19 in cafb4a4

def substr_matching(text, metadata):

- it seems like you match the case of the query as is, but just wanted to double check if everything is converted to lowercase beforehand?
(3) Was the performance verified across 400M sets curated from multiple 1.6B pools (collecting from different random CC shards)?

howardhsu · 2023-11-27T19:44:37Z

Thx for reply and those are indeed good questions. Overall, our goal is to be extremely simple and be as raw as possible, we didn't run any preprocessing unless required for legal reason or specified by OpenAI CLIP paper.

Ans. to
(1) We ran curation:balancing twice: one before image downloading to save Internet traffic/storage with a much higher t and one before training (re-calibration) (t=20k for 400M or t=170k for 2.5B).
We only run dedup images and dedup text belonging in-between these two balancings; we are confident no magic in these steps and tuning of preprocessing may improve performance further but we didn't do so to keep it simple and be more general/task-agnostic.
(2) matching original cases are very important to ensure quality of texts/captions (DOG has a higher chance of from a bad source like spam email), please don't convert to lower case (people are lured to quantity but quality matters first in this paper)
(3) no, re-collection is expensive and we only have one collection for 400M (from 1.6B pool); but we will have multiple seeds of balancing from 1.6B reported in an updated version of this paper.

Hope these answered your question and let us know if any more question.

varadgunjal · 2023-12-12T01:12:36Z

Yes this was very helpful. Thank you!

One follow-up question : Can you share the exact text sources were used for constructing the 500K query set - more specifically, when considering "uni-grams from the English version of Wikipedia occurring at least 100 times" can you point me to the link of the dataset that contains Wikipedia text you used?

howardhsu · 2024-03-25T21:09:54Z

We release the metadata construction code end of 2023, please check the code https://github.com/facebookresearch/MetaCLIP/blob/main/metaclip/README_metadata.md
check this issue for sources of pre-computed unigram/bi-grams

varadgunjal changed the title ~~Confusion regarding construciton of 400M dataset~~ Confusion regarding construction of 400M dataset Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion regarding construction of 400M dataset #24

Confusion regarding construction of 400M dataset #24

varadgunjal commented Nov 1, 2023

howardhsu commented Nov 8, 2023 •

edited

Loading

varadgunjal commented Nov 15, 2023 •

edited

Loading

howardhsu commented Nov 27, 2023

varadgunjal commented Dec 12, 2023

howardhsu commented Mar 25, 2024

Confusion regarding construction of 400M dataset #24

Confusion regarding construction of 400M dataset #24

Comments

varadgunjal commented Nov 1, 2023

howardhsu commented Nov 8, 2023 • edited Loading

varadgunjal commented Nov 15, 2023 • edited Loading

howardhsu commented Nov 27, 2023

varadgunjal commented Dec 12, 2023

howardhsu commented Mar 25, 2024

howardhsu commented Nov 8, 2023 •

edited

Loading

varadgunjal commented Nov 15, 2023 •

edited

Loading