Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor datamodule #121

Merged
merged 19 commits into from
Apr 12, 2022
Merged

Refactor datamodule #121

merged 19 commits into from
Apr 12, 2022

Conversation

ArneBinder
Copy link
Owner

@ArneBinder ArneBinder commented Mar 11, 2022

This is in preparation for ArneBinder/pytorch-ie-hydra-template-1#1.

Relevant changes:

  • add parameters: train_split, val_split, and test_split
  • prepare_split defaults to value of train_split
  • taskmodule.prepare is only called if stage == "fit" or stage is None
  • since all constructor parameters for pytorch_lightning.DataModule are deprecated, pass remaining keyword arguments passed to __init__ to the DataLoaders
  • simplifies the setup method a bit
  • add simple PIEDatasetDict = Dict[Union[str, Split], List[Document]], in analogy to huggingface DatasetDict, to the data.datasets package

@ArneBinder ArneBinder changed the title Refactor datamodule [WIP] Refactor datamodule Mar 11, 2022
@ArneBinder ArneBinder changed the title [WIP] Refactor datamodule Refactor datamodule Mar 14, 2022
@ArneBinder ArneBinder requested a review from ChristophAlt March 14, 2022 11:58
@ArneBinder ArneBinder mentioned this pull request Mar 28, 2022
@ArneBinder ArneBinder force-pushed the refactor/datamodule branch from a2b5f63 to 6fa5d08 Compare April 12, 2022 12:04
@ArneBinder ArneBinder changed the title Refactor datamodule [WIP] Refactor datamodule Apr 12, 2022
@ArneBinder ArneBinder changed the title [WIP] Refactor datamodule Refactor datamodule Apr 12, 2022
@ArneBinder
Copy link
Owner Author

ArneBinder commented Apr 12, 2022

Honestly, this is beyond confusing. Why not just use either train_sizeor val_size, which is Union[int, float]? If you want to limit the number of training or validation examples, pytorch-lightning provides all the functionality.

I'm not so much experienced with pytorch-lightning yet, how would it work with that? E.g. if I want to have 5% or 3 examples from my train data each for training and for validation?

@ArneBinder
Copy link
Owner Author

ArneBinder commented Apr 12, 2022

And yes, I also don't like the random_train_val_split parameter.

@ChristophAlt
Copy link
Collaborator

Honestly, this is beyond confusing. Why not just use either train_sizeor val_size, which is Union[int, float]? If you want to limit the number of training or validation examples, pytorch-lightning provides all the functionality.

I'm not so much experienced with pytorch-lightning yet, how would it work with that? E.g. if I want to have 5% or 3 examples from my train data each for training and for validation?

One way to do this is described here -- you can limit the number of batches by either providing an int (num batches) or float (fraction of batches).

@ArneBinder
Copy link
Owner Author

So we have just a parameter val_size to split some part from the train data?

@ChristophAlt
Copy link
Collaborator

And yes, I also don't like the random_train_val_split parameter.

This will be done by the PIE Dataset / DatasetDict in the future, as it will provide the same interface as the HF implementation (see here).

@ArneBinder
Copy link
Owner Author

ArneBinder commented Apr 12, 2022

And yes, I also don't like the random_train_val_split parameter.

This will be done by the PIE Dataset / DatasetDict in the future, as it will provide the same interface as the HF implementation (see here).

OK, this really depends on which level the PIE Dataset / DatasetDict lives on. Is it about Documents or training instances (TaskEncodings)? Until now I thought a PIE Dataset is about Documents. EDIT: Note that all the logic related to random_train_val_split is about training instances (TaskEncodings).

@ArneBinder
Copy link
Owner Author

@ChristophAlt can you have a look again?

@ChristophAlt ChristophAlt merged commit b0ce7af into main Apr 12, 2022
@ChristophAlt ChristophAlt deleted the refactor/datamodule branch April 17, 2022 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants