Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛BUG] 采用 benchmark_filename 之后不会对 train, valid, test 数据集中的数据进行 shuffle 吗? #2076

Open
taolinzhang opened this issue Aug 22, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@taolinzhang
Copy link

if self.benchmark_filename_list is not None:
self._drop_unused_col()
cumsum = list(np.cumsum(self.file_size_list))
datasets = [
self.copy(self.inter_feat[start:end])
for start, end in zip([0] + cumsum[:-1], cumsum)
]
return datasets
# ordering
ordering_args = self.config["eval_args"]["order"]
if ordering_args == "RO":
self.shuffle()

这里 shuffle 操作只在不使用 benchmark_filename_list 以及 ordering_args == "RO" 时候才会进行.
所以使用 benchmark_filename_list 自定义 split 后, 由于没有 shuffle 导致了性能下降.

@taolinzhang taolinzhang added the bug Something isn't working label Aug 22, 2024
@zhengbw0324 zhengbw0324 self-assigned this Aug 27, 2024
@zhengbw0324
Copy link
Collaborator

zhengbw0324 commented Aug 27, 2024

@iridescentttt
您好!使用benchmark_filename后,我们不会在dataset中对数据进行shuffle,防止破坏数据划分界限。但在训练中,使用的是Pytorch的dataloader,会将训练数据进行shuffle。

train_data = get_dataloader(config, "train")(
config, train_dataset, train_sampler, shuffle=config["shuffle"]
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants