ExpectedMoreSplits error when loading C4 dataset #6746

billwang485 · 2024-03-21T02:53:04Z

Describe the bug

I encounter bug when running the example command line

    python main.py \
    --model decapoda-research/llama-7b-hf \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save out/llama_7b/unstructured/wanda/

The bug occurred at these lines of code (when loading c4 dataset)

traindata = load_dataset('allenai/c4', 'allenai--c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', 'allenai--c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

The error message states:

raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))                                                                           
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

Steps to reproduce the bug

I encounter bug when running the example command line

Expected behavior

The error message states:

raise ExpectedMoreSplits(str(set(expected_splits) - set(recorded_splits)))                                                                           
datasets.utils.info_utils.ExpectedMoreSplits: {'validation'}

Environment info

I'm using cuda 12.4, so I use pip install pytorch instead of conda provided in install.md

Also, I've tried another environment using the same commands in install.md, but the same bug occured

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-03-21T14:06:06Z

Hi ! We updated the allenai/c4 repository to allow people to specify which language to load easily (the the c4 dataset page)

To fix this issue you can update datasets and remove the mention of the legacy configuration name "allenai--c4":

traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

K-THU · 2024-04-01T05:12:24Z

Did you solve this problem？I have the same bug.It is no use to delete "allenai--c4".

xuChenSJTU · 2024-04-09T07:30:55Z

Did you solve it? I met this problem too.

ssy-small-white · 2024-04-21T02:54:43Z

But after I romove allenai--c4,it still fails

davidbhoffmann · 2024-04-22T16:30:14Z

For me it works this way. I'm using datasets version 2.17.0

chaosright · 2024-07-25T16:33:41Z

First, pip install --upgrade datasets.
Second, Update the following two lines of code in data.py (in lib)
traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

albertvillanova · 2024-07-29T07:21:08Z

The error is in the Wanda repository: https://github.com/locuslab/wanda

Llama2 ExpectedMoreSplits Exception locuslab/wanda#57

Concretely, in these code lines:
https://github.com/locuslab/wanda/blob/8e8fc87b4a2f9955baa7e76e64d5fce7fa8724a6/lib/data.py#L43-L44

Please report there and/or make the fix in their code.

SimWangArizona · 2024-09-18T19:57:12Z

traindata = load_dataset('allenai/c4', data_files={'train': 'en/c4-train.00000-of-01024.json.gz'}, split='train')
valdata = load_dataset('allenai/c4', data_files={'validation': 'en/c4-validation.00000-of-00008.json.gz'}, split='validation')

Solved for me ! Thanks!

rsong0606 mentioned this issue Apr 29, 2024

Support for LLaMA-2 locuslab/wanda#23

Open

albertvillanova closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExpectedMoreSplits error when loading C4 dataset #6746

ExpectedMoreSplits error when loading C4 dataset #6746

billwang485 commented Mar 21, 2024

lhoestq commented Mar 21, 2024 •

edited

Loading

K-THU commented Apr 1, 2024

xuChenSJTU commented Apr 9, 2024

ssy-small-white commented Apr 21, 2024

davidbhoffmann commented Apr 22, 2024

chaosright commented Jul 25, 2024

albertvillanova commented Jul 29, 2024

SimWangArizona commented Sep 18, 2024

ExpectedMoreSplits error when loading C4 dataset #6746

ExpectedMoreSplits error when loading C4 dataset #6746

Comments

billwang485 commented Mar 21, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Mar 21, 2024 • edited Loading

K-THU commented Apr 1, 2024

xuChenSJTU commented Apr 9, 2024

ssy-small-white commented Apr 21, 2024

davidbhoffmann commented Apr 22, 2024

chaosright commented Jul 25, 2024

albertvillanova commented Jul 29, 2024

SimWangArizona commented Sep 18, 2024

lhoestq commented Mar 21, 2024 •

edited

Loading