Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExpectedMoreSplits error when loading C4 dataset #6746

Closed
billwang485 opened this issue Mar 21, 2024 · 8 comments
Closed

ExpectedMoreSplits error when loading C4 dataset #6746

billwang485 opened this issue Mar 21, 2024 · 8 comments

Comments

@billwang485
Copy link

Describe the bug

I encounter bug when running the example command line

    python main.py \
    --model decapoda-research/llama-7b-hf \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save out/llama_7b/unstructured/wanda/ 

The bug occurred at these lines of code (when loading c4 dataset)

traindata = load_dataset('allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

The error message states:

raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))                                                                           
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

Steps to reproduce the bug

  1. I encounter bug when running the example command line

Expected behavior

The error message states:

raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))                                                                           
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

Environment info

I'm using cuda 12.4, so I use pip install pytorch instead of conda provided in install.md

Also, I've tried another environment using the same commands in install.md, but the same bug occured

@lhoestq
Copy link
Member

lhoestq commented Mar 21, 2024

Hi ! We updated the allenai/c4 repository to allow people to specify which language to load easily (the the c4 dataset page)

To fix this issue you can update datasets and remove the mention of the legacy configuration name "allenai--c4":

traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

@K-THU
Copy link

K-THU commented Apr 1, 2024

Did you solve this problem?I have the same bug.It is no use to delete "allenai--c4".

@xuChenSJTU
Copy link

Did you solve it? I met this problem too.

@ssy-small-white
Copy link

But after I romove allenai--c4,it still fails

@davidbhoffmann
Copy link

For me it works this way. I'm using datasets version 2.17.0

@chaosright
Copy link

First, pip install --upgrade datasets.
Second, Update the following two lines of code in data.py (in lib)
traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

@albertvillanova
Copy link
Member

The error is in the Wanda repository: https://github.com/locuslab/wanda

Concretely, in these code lines:
https://github.com/locuslab/wanda/blob/8e8fc87b4a2f9955baa7e76e64d5fce7fa8724a6/lib/data.py#L43-L44

Please report there and/or make the fix in their code.

@SimWangArizona
Copy link

traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

Solved for me ! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants