-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusion regarding construction of 400M dataset #24
Comments
thx for the confirming and checking the match of paper and released distribution. Let me explain as the following: |
Thanks! This explanation makes sense. A few other follow-up questions - (1) Was any simple text preprocessing / image preprocessing / deduplication done on the sanples from the CC shards before running curation on them? MetaCLIP/metaclip/substr_matching.py Line 19 in cafb4a4
(3) Was the performance verified across 400M sets curated from multiple 1.6B pools (collecting from different random CC shards)? |
Thx for reply and those are indeed good questions. Overall, our goal is to be extremely simple and be as raw as possible, we didn't run any preprocessing unless required for legal reason or specified by OpenAI CLIP paper. Ans. to Hope these answered your question and let us know if any more question. |
Yes this was very helpful. Thank you! One follow-up question : Can you share the exact text sources were used for constructing the 500K query set - more specifically, when considering "uni-grams from the English version of Wikipedia occurring at least 100 times" can you point me to the link of the dataset that contains Wikipedia text you used? |
We release the metadata construction code end of 2023, please check the code https://github.com/facebookresearch/MetaCLIP/blob/main/metaclip/README_metadata.md |
Based on the details provided in the MetaCLIP paper, I understand that the magic number
t
= 20k is a threshold used to limit the number of texts/pairs for each entry. Entries with fewer thant
pairs (tail entries) retain all associated pairs, while entries with more thant
pairs (head entries) are sub-sampled tot
pairs.It is also mentioned that there are ~16K entries in the 500K query list which have counts >= 20K and they account for 5.35B / 5.67B total matches. This implies that the remaining ~484K have counts < 20K and they account for 5.67B - 5.35B ≈ 320M matches. I have checked these numbers using the
entry_counts_400m.json
(https://github.com/facebookresearch/MetaCLIP/blob/main/metaclip/entry_counts_400m.json) file and they line up.Based on these 2 pieces of information, I understand that we would take 20K pairs for each of the 16K entries that have counts >= 20K => ~20K * 16K ≈ 320M samples. And then add in all matches for the remaining 484K entries which from above = 320M. This gives a dataset of size 320M + 320M = 640M.
Is my understanding and the demonstrated calculation correct? And if, yes, how is this subsampled to 400M?
The text was updated successfully, but these errors were encountered: