A certain degree of mismatch between the `Key` and the original DataComp-1B #11

Coobiw · 2024-07-01T08:15:18Z

Thanks for your great work! I've tried to match Recap-DataComp-1B with DataComp-1B by the key.(Reference:#7) However, when I generate 50M data, I find that ~5M mismatches.(I use the huggingface annotations to get a key2recaption mapping. ~5M KeyErrors were raised.) Would you give me some advice? Sincerely looking forward to your reply!

The text was updated successfully, but these errors were encountered:

Coobiw · 2024-07-01T08:17:24Z

Additionally, I also find a mismatch between image and caption. Like following:

xhl-video · 2024-07-01T23:03:33Z

Thanks for your great work! I've tried to match Recap-DataComp-1B with DataComp-1B by the key.(Reference:#7) However, when I generate 50M data, I find that ~5M mismatches.(I use the huggingface annotations to get a key2recaption mapping. ~5M KeyErrors were raised.) Would you give me some advice? Sincerely looking forward to your reply!

Hi, thanks for your interest! Have you tried to use sha256 to match? I am not sure if the mismatch happened because our DataComp1B contains fewer samples than the original one (due to URL invalid issues when downloading)

Coobiw · 2024-07-04T07:25:02Z

Hi, I use the following code to get sha256-to-recaption mapping.

import os
import pickle
from tqdm import tqdm
import datasets

os.environ["HF_DATASETS_OFFLINE"] = "1"
datasets.config.HF_DATASETS_OFFLINE = True

from datasets import load_dataset

# 加载数据集
data = load_dataset("xxx/Recap-DataComp-1B/hf_anno")
print("Finish Loading...")

train_data = data['train']

# 初始化字典
sha256torecaption = {}
part = 1
save_threshold = 400_000_000  # 每个部分的估计条目数

# 填充字典并划分
for item in tqdm(train_data):
    meta_key, re_caption = item['sha256'], item['re_caption']
    sha256torecaption[meta_key] = re_caption
    
    # 检查是否需要保存和清除当前部分
    if len(sha256torecaption) >= save_threshold:
        with open(f'sha256torecaption_part{part}.pkl', 'wb') as f:
            pickle.dump(sha256torecaption, f)
        print(f"Part {part} saved as sha256torecaption_part{part}.pkl")
        del sha256torecaption
        sha256torecaption = {}  # 清除字典
        part += 1

# 保存剩余的数据
if sha256torecaption:
    with open(f'sha256torecaption_part{part}.pkl', 'wb') as f:
        pickle.dump(sha256torecaption, f)
    print(f"Part {part} saved as sha256torecaption_part{part}.pkl")

print("All parts saved.")

After this, I get three part of sha256torecaption. I check the length of them as following:

>>> a = pickle.load(open("sha256torecaption_part1.pkl","rb"))
>>> len(a)
400000000
>>> a = pickle.load(open("sha256torecaption_part2.pkl","rb"))
>>> len(a)
400000000
>>> a = pickle.load(open("sha256torecaption_part3.pkl","rb"))
>>> len(a)
256418986

The sum of the lengths is 1.05B. However, Recap-DataComp has 1.23B items. So I want to know whether sh256 is unique for each item?

Thanks for your reply!

LinB203 · 2024-08-05T07:38:58Z

Additionally, I also find a mismatch between image and caption. Like following:

Hello, do you find this situation common? Approximately what percentage does it account for? @Coobiw

Coobiw · 2024-08-05T11:22:12Z

@LinB203 I'm not sure about the exact percentage. But in my case, many of them are mix-matched (if use 'key' to mapping). As for the sh256, I find some overlap. So I think that the sh256 may be not unique.

LinB203 · 2024-08-07T02:50:52Z

@LinB203 I'm not sure about the exact percentage. But in my case, many of them are mix-matched (if use 'key' to mapping). As for the sh256, I find some overlap. So I think that the sh256 may be not unique.

I think we need your further explaination to figure out this problem. @xhl-video

xhl-video · 2024-08-07T04:20:43Z

Hi,

Thank you for your patience. It took me some time to find the specific research that caused this problem. In short, the issues are both highly related to the img2dataset downloading tool.

First of all, the key mismatch problem: The key is computed by the specific shard id, oom_sample_per_shard, and oom_shard_count see here when using img2dataset. So if you downloaded the DataComp-1B dataset with different scripts than us, it is natural that the keys are not the same as ours. I won't say the caption is mismatched, because the image should not be the same either.

Second, the sha256 problem: That is a good catch. We did indeed find duplicated sha256 hashes in our dataset, indicating the same image according to img2dataset see here. After further investigation, many repeated sha256 hashes are images downloaded with errors, like in this example: . The issue arises when downloading images from certain URLs; if the URL is invalid, an error image is downloaded, and img2dataset uses the image stream to encode the sha256, resulting in duplication.

In conclusion, if you have a DataComp-1B dataset and try to use our captions, you should use sha256 or URL to match instead of the key. However, both sha256 and URL may have some mismatches. The repeated sha256 issue may also occur in your version of the DataComp-1B dataset. I have cleaned all repeated images and will upload the clean version as soon as possible.

Thanks again, and if you have further questions, I am happy to answer them.

LinB203 · 2024-08-07T04:23:39Z

Great, once you upload the latest version, please let me know. Thanks!

xhl-video · 2024-08-07T16:45:10Z

Great, once you upload the latest version, please let me know. Thanks!

Hi, I have updated the cleaned dataset in the huggingface. See Here. As you can see, the duplicated images (including error images that have different captions) in DataComp-1B are around 0.29 billion. I highly recommend checking the duplication issue in your version of DataComp-1B as well. If you have any questions, please feel free to let me know.

kdwonn · 2024-08-20T20:24:25Z

Hi,

Thank you for your patience. It took me some time to find the specific research that caused this problem. In short, the issues are both highly related to the img2dataset downloading tool.

First of all, the key mismatch problem: The key is computed by the specific shard id, oom_sample_per_shard, and oom_shard_count see here when using img2dataset. So if you downloaded the DataComp-1B dataset with different scripts than us, it is natural that the keys are not the same as ours. I won't say the caption is mismatched, because the image should not be the same either.

Second, the sha256 problem: That is a good catch. We did indeed find duplicated sha256 hashes in our dataset, indicating the same image according to img2dataset see here. After further investigation, many repeated sha256 hashes are images downloaded with errors, like in this example: . The issue arises when downloading images from certain URLs; if the URL is invalid, an error image is downloaded, and img2dataset uses the image stream to encode the sha256, resulting in duplication.

In conclusion, if you have a DataComp-1B dataset and try to use our captions, you should use sha256 or URL to match instead of the key. However, both sha256 and URL may have some mismatches. The repeated sha256 issue may also occur in your version of the DataComp-1B dataset. I have cleaned all repeated images and will upload the clean version as soon as possible.

Thanks again, and if you have further questions, I am happy to answer them.

Hi, can you share the SHA256 keys for these error images? I also encountered few during training but forgot to record the hash keys.

xhl-video · 2024-08-25T06:21:37Z

Hi,
Thank you for your patience. It took me some time to find the specific research that caused this problem. In short, the issues are both highly related to the img2dataset downloading tool.
First of all, the key mismatch problem: The key is computed by the specific shard id, oom_sample_per_shard, and oom_shard_count see here when using img2dataset. So if you downloaded the DataComp-1B dataset with different scripts than us, it is natural that the keys are not the same as ours. I won't say the caption is mismatched, because the image should not be the same either.
Second, the sha256 problem: That is a good catch. We did indeed find duplicated sha256 hashes in our dataset, indicating the same image according to img2dataset see here. After further investigation, many repeated sha256 hashes are images downloaded with errors, like in this example: . The issue arises when downloading images from certain URLs; if the URL is invalid, an error image is downloaded, and img2dataset uses the image stream to encode the sha256, resulting in duplication.
In conclusion, if you have a DataComp-1B dataset and try to use our captions, you should use sha256 or URL to match instead of the key. However, both sha256 and URL may have some mismatches. The repeated sha256 issue may also occur in your version of the DataComp-1B dataset. I have cleaned all repeated images and will upload the clean version as soon as possible.
Thanks again, and if you have further questions, I am happy to answer them.

Hi, can you share the SHA256 keys for these error images? I also encountered few during training but forgot to record the hash keys.

Sure, I have uploaded all duplicated SHA256 in the OneDrive Duplicated Sha256. Please let me know if this works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A certain degree of mismatch between the `Key` and the original DataComp-1B #11

A certain degree of mismatch between the `Key` and the original DataComp-1B #11

Coobiw commented Jul 1, 2024

Coobiw commented Jul 1, 2024

xhl-video commented Jul 1, 2024

Coobiw commented Jul 4, 2024 •

edited

Loading

LinB203 commented Aug 5, 2024 •

edited

Loading

Coobiw commented Aug 5, 2024 •

edited

Loading

LinB203 commented Aug 7, 2024

xhl-video commented Aug 7, 2024

LinB203 commented Aug 7, 2024

xhl-video commented Aug 7, 2024

kdwonn commented Aug 20, 2024

xhl-video commented Aug 25, 2024

A certain degree of mismatch between the Key and the original DataComp-1B #11

A certain degree of mismatch between the Key and the original DataComp-1B #11

Comments

Coobiw commented Jul 1, 2024

Coobiw commented Jul 1, 2024

xhl-video commented Jul 1, 2024

Coobiw commented Jul 4, 2024 • edited Loading

LinB203 commented Aug 5, 2024 • edited Loading

Coobiw commented Aug 5, 2024 • edited Loading

LinB203 commented Aug 7, 2024

xhl-video commented Aug 7, 2024

LinB203 commented Aug 7, 2024

xhl-video commented Aug 7, 2024

kdwonn commented Aug 20, 2024

xhl-video commented Aug 25, 2024

A certain degree of mismatch between the `Key` and the original DataComp-1B #11

A certain degree of mismatch between the `Key` and the original DataComp-1B #11

Coobiw commented Jul 4, 2024 •

edited

Loading

LinB203 commented Aug 5, 2024 •

edited

Loading

Coobiw commented Aug 5, 2024 •

edited

Loading