handle split
parameter for DatasetDict.load_dataset()
#348
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With this PR, it is possible to call
DatasetDict.load_dataset(..., split=SPLIT)
orDatasetDict.from_json(..., split=SPLIT)
to just load the data of a selectedSPLIT
. This is helpful when using large datasets where getting the data and calling_generate_document()
already takes a lot of time and should not be executed for splits that are not used at all.Usage:
pip uninstall pytorch-ie -y && pip install git+https://github.com/ChristophAlt/pytorch-ie.git@load_dataset_with_split
+dataset.input.split=test
Changes:
DatasetDict.load_dataset()
andDatasetDict.from_json()
: unwrapsplit
parameter and use it if baseload_dataset()
returns a simple (Huggingface)(Iterable)Dataset
instead of a...Dict
DatasetDict.load_dataset()
DatasetDict.from_hf()
: simplify to only work with(Iterable)DatasetDicts
or dicts containing(Iterable)Datasets