Refactor datamodule #121

ArneBinder · 2022-03-11T15:50:24Z

This is in preparation for ArneBinder/pytorch-ie-hydra-template-1#1.

Relevant changes:

add parameters: train_split, val_split, and test_split
prepare_split defaults to value of train_split
taskmodule.prepare is only called if stage == "fit" or stage is None
since all constructor parameters for pytorch_lightning.DataModule are deprecated, pass remaining keyword arguments passed to __init__ to the DataLoaders
simplifies the setup method a bit
add simple PIEDatasetDict = Dict[Union[str, Split], List[Document]], in analogy to huggingface DatasetDict, to the data.datasets package

src/pytorch_ie/data/datamodules/datamodule.py

…st_split; prepare_data_split defaults to value of train_split

…ule to be consistent; use DatasetDict

…able

ArneBinder · 2022-04-12T12:42:50Z

Honestly, this is beyond confusing. Why not just use either train_sizeor val_size, which is Union[int, float]? If you want to limit the number of training or validation examples, pytorch-lightning provides all the functionality.

I'm not so much experienced with pytorch-lightning yet, how would it work with that? E.g. if I want to have 5% or 3 examples from my train data each for training and for validation?

ArneBinder · 2022-04-12T12:45:35Z

And yes, I also don't like the random_train_val_split parameter.

ChristophAlt · 2022-04-12T12:46:47Z

Honestly, this is beyond confusing. Why not just use either train_sizeor val_size, which is Union[int, float]? If you want to limit the number of training or validation examples, pytorch-lightning provides all the functionality.

I'm not so much experienced with pytorch-lightning yet, how would it work with that? E.g. if I want to have 5% or 3 examples from my train data each for training and for validation?

One way to do this is described here -- you can limit the number of batches by either providing an int (num batches) or float (fraction of batches).

ArneBinder · 2022-04-12T12:48:51Z

So we have just a parameter val_size to split some part from the train data?

ChristophAlt · 2022-04-12T12:49:10Z

And yes, I also don't like the random_train_val_split parameter.

This will be done by the PIE Dataset / DatasetDict in the future, as it will provide the same interface as the HF implementation (see here).

ArneBinder · 2022-04-12T12:58:17Z

And yes, I also don't like the random_train_val_split parameter.

This will be done by the PIE Dataset / DatasetDict in the future, as it will provide the same interface as the HF implementation (see here).

OK, this really depends on which level the PIE Dataset / DatasetDict lives on. Is it about Documents or training instances (TaskEncodings)? Until now I thought a PIE Dataset is about Documents. EDIT: Note that all the logic related to random_train_val_split is about training instances (TaskEncodings).

ArneBinder · 2022-04-12T19:54:16Z

@ChristophAlt can you have a look again?

ArneBinder mentioned this pull request Mar 11, 2022

integrate pytorch-ie ArneBinder/pytorch-ie-hydra-template-1#1

Merged

5 tasks

ArneBinder changed the title ~~Refactor datamodule~~ [WIP] Refactor datamodule Mar 11, 2022

ArneBinder changed the title ~~[WIP] Refactor datamodule~~ Refactor datamodule Mar 14, 2022

ArneBinder requested a review from ChristophAlt March 14, 2022 11:58

ArneBinder mentioned this pull request Mar 28, 2022

[WIP] Template2 #126

Closed

ChristophAlt reviewed Apr 11, 2022

View reviewed changes

src/pytorch_ie/data/datamodules/datamodule.py Outdated Show resolved Hide resolved

ArneBinder added 15 commits April 12, 2022 13:58

add DataModule and simple DatasetDict to data package

cb7d5c7

simplify Datamodule and add parameters train_split, val_split, and te…

191d7f4

…st_split; prepare_data_split defaults to value of train_split

prepare taskmodule only when stage="fit" or None

adcb688

fix typing

2f3be0a

remove DataModule from data package to prevent circular import

5d7e2ac

add note regarding cyclic import

1d6d16c

rename prepare_data_split to prepare_split and task_module to taskmod…

cd2f0ca

…ule to be consistent; use DatasetDict

pass all remaining parameters to the dataloaders

159eb5e

to the dark side (make black happy)

444db0a

fix types

6d40948

skip creating encoded splits if respective dataset split is not avail…

60e8125

…able

fix creation of random train val split

c4a9b85

blackify

7b9e211

create and use PIEDatasetDict instead of DatasetDict

aa746b4

revert data.__init__ to state of main branch to decrease noise

6fa5d08

ArneBinder force-pushed the refactor/datamodule branch from a2b5f63 to 6fa5d08 Compare April 12, 2022 12:04

ArneBinder added 2 commits April 12, 2022 14:14

use same name for validation split as hf datasets does

db863c2

fix exception message

75c458a

ArneBinder changed the title ~~Refactor datamodule~~ [WIP] Refactor datamodule Apr 12, 2022

allow relative split sizes for entries in random_train_val_split

a0d2b9d

ArneBinder changed the title ~~[WIP] Refactor datamodule~~ Refactor datamodule Apr 12, 2022

remove functionality to create a random train val split

ffdf05e

ChristophAlt merged commit b0ce7af into main Apr 12, 2022

ChristophAlt deleted the refactor/datamodule branch April 17, 2022 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor datamodule #121

Refactor datamodule #121

ArneBinder commented Mar 11, 2022 •

edited

Loading

ArneBinder commented Apr 12, 2022 •

edited

Loading

ArneBinder commented Apr 12, 2022 •

edited

Loading

ChristophAlt commented Apr 12, 2022

ArneBinder commented Apr 12, 2022

ChristophAlt commented Apr 12, 2022

ArneBinder commented Apr 12, 2022 •

edited

Loading

ArneBinder commented Apr 12, 2022

Refactor datamodule #121

Refactor datamodule #121

Conversation

ArneBinder commented Mar 11, 2022 • edited Loading

ArneBinder commented Apr 12, 2022 • edited Loading

ArneBinder commented Apr 12, 2022 • edited Loading

ChristophAlt commented Apr 12, 2022

ArneBinder commented Apr 12, 2022

ChristophAlt commented Apr 12, 2022

ArneBinder commented Apr 12, 2022 • edited Loading

ArneBinder commented Apr 12, 2022

ArneBinder commented Mar 11, 2022 •

edited

Loading

ArneBinder commented Apr 12, 2022 •

edited

Loading

ArneBinder commented Apr 12, 2022 •

edited

Loading

ArneBinder commented Apr 12, 2022 •

edited

Loading