handle `split` parameter for `DatasetDict.load_dataset()` #348

ArneBinder · 2023-09-18T14:39:26Z

With this PR, it is possible to call DatasetDict.load_dataset(..., split=SPLIT) or DatasetDict.from_json(..., split=SPLIT) to just load the data of a selected SPLIT. This is helpful when using large datasets where getting the data and calling _generate_document() already takes a lot of time and should not be executed for splits that are not used at all.

Usage:

re-install pytorch-ie: pip uninstall pytorch-ie -y && pip install git+https://github.com/ChristophAlt/pytorch-ie.git@load_dataset_with_split
call the eval command with +dataset.input.split=test

Changes:

DatasetDict.load_dataset() and DatasetDict.from_json(): unwrap split parameter and use it if base load_dataset() returns a simple (Huggingface) (Iterable)Dataset instead of a ...Dict
add type checks to DatasetDict.load_dataset()
DatasetDict.from_hf(): simplify to only work with (Iterable)DatasetDicts or dicts containing (Iterable)Datasets

…ad_dataset() returns no dataset dict; add type checks

…ts or dicts containing (Iterable)Datasets; add explicit split parameter to DatasetDict.from_json()

DatasetDict.load_dataset(): unwrap split parameter and use if base lo…

02bbb89

…ad_dataset() returns no dataset dict; add type checks

ArneBinder added the bug Something isn't working label Sep 18, 2023

simplify DatasetDict.from_hf() to only work with (Iterable)DatasetDic…

863f0db

…ts or dicts containing (Iterable)Datasets; add explicit split parameter to DatasetDict.from_json()

ArneBinder merged commit 119a725 into main Sep 19, 2023

ArneBinder deleted the load_dataset_with_split branch September 19, 2023 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle `split` parameter for `DatasetDict.load_dataset()` #348

handle `split` parameter for `DatasetDict.load_dataset()` #348

ArneBinder commented Sep 18, 2023 •

edited

Loading

handle split parameter for DatasetDict.load_dataset() #348

handle split parameter for DatasetDict.load_dataset() #348

Conversation

ArneBinder commented Sep 18, 2023 • edited Loading

handle `split` parameter for `DatasetDict.load_dataset()` #348

handle `split` parameter for `DatasetDict.load_dataset()` #348

ArneBinder commented Sep 18, 2023 •

edited

Loading