Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor conll2003 #122

Merged
merged 5 commits into from
Apr 11, 2022
Merged

refactor conll2003 #122

merged 5 commits into from
Apr 11, 2022

Conversation

ArneBinder
Copy link
Owner

@ArneBinder ArneBinder commented Mar 14, 2022

This PR simplifies the usage of the conll2003 dataset and is in preparation for ArneBinder/pytorch-ie-hydra-template-1#1.

Changes:

  • add PIEDatasetDictto the data.datasets package, defined as Dict[Union[str, Split], List[Document]] in analogy to huggingface DatasetDict.
  • split load_conll2003:
    • load_hf_conll2003: this loads the dataset as an huggingface datasets.DatasetDict
    • from_hf: this converts the output from above to an PIEDatasetDict
  • add a helper method single_split_to_dict that unifies the output of huggingface load_data to a (iterable)DatasetDict. This may be moved to another place in the future, if it proves useful.

Motivation for the split: In the future (e.g. the template), we will mostly just need conll2003.from_hf and use methods from huggingface datasets to get the data in the first place.

@ArneBinder ArneBinder requested a review from ChristophAlt March 14, 2022 11:58
@ArneBinder ArneBinder mentioned this pull request Mar 28, 2022
@ChristophAlt ChristophAlt merged commit ed634ac into main Apr 11, 2022
@ChristophAlt ChristophAlt deleted the refactor/conll2003 branch April 17, 2022 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants