Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworked TaskModule and Pipeline #149

Merged
merged 3 commits into from
Apr 29, 2022
Merged

Conversation

ChristophAlt
Copy link
Collaborator

@ChristophAlt ChristophAlt commented Apr 27, 2022

  • Pipeline now works on Datasets. A InplaceNotSupportedException exception is raised if the user attempts to modify a Dataset inplace.

  • TaskModule's encode_input and encode_target now receive a single document as input. This may be extended by a batched version in the future.

  • The parameter naming of methods in TaskModule is now more consistent and returning a TaskEncoding in encode_input considerably simplifies interaction between different functions.

  • Where possible the container types have been relaxed to Sequence

  • Input documents do not have to be passed around anymore (e.g. to decode). Instead encode_target returns a TaskEncodingSequence that allows us to determine the ordering of documents that have been passed to encode. This is particularly important for TaskModules that either generate multiple TaskEncodings per input Document, or do not generate any. One example is binary relation extraction, where each combination of two spans produces a TaskEncoding but Documents without spans produce no TaskEncoding at all. In the later case we still want to copy and return the corresponding document, for instance, if inplace=True.

  • Note: If a Dataset is passed to TaskModule decode, inplace will be ignored implicitly, as there is now way to determine that a Dataset was used to create the TaskEncodings. Edit: This is not entirely true, TaskEncodingSequence could be used to pass on the info, e.g. as inplace_allowed={True, False}.

- Pipeline now works on Datasets. A InplaceNotSupportedException exception is raised if the user attempts to modify a Dataset inplace.

- TaskModule's encode_input and encode_target now receive a single document as input. This may be extended by a batched version in the future.
- The parameter naming of methods in TaskModule is now more consistent.
- Where possible the container types have been relaxed to Sequence
- Input documents do not have to be passed around anymore (e.g. to decode). Instead encode_target returns a TaskEncodingSequence that allows us to determine the ordering of documents that have been passed to encode. This is particularly important for TaskModules that either generate multiple TaskEncodings per input Document, or do not generate any. One example is binary relation extraction, where each combination of two spans produces a TaskEncoding but Documents without spans produce no TaskEncoding at all. In the later case we still want to copy and return it if for instance inplace=True.
@ChristophAlt ChristophAlt requested a review from ArneBinder April 27, 2022 10:50
Copy link
Owner

@ArneBinder ArneBinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review part 0.1 (early question, tomorrow comes more)

Copy link
Owner

@ArneBinder ArneBinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really nice. However, I'm not so sure about encode_inputs (and encode) returning TaskEncodingSequence in instead of just the simple Sequence[TaskEncoding]. Could you have a look at #150 (which implements your changes without that)? I'm not 100% for one of the two versions, but let's discuss both of them.

@ChristophAlt
Copy link
Collaborator Author

ChristophAlt commented Apr 29, 2022

#150 would also work but I can't see the advantage. It adds more complexity to methods and now you have to remember which collection doc_idx is refering to. That's what TaskEncodingSequence is for, you have a single source of truth to decode the output. You mention that it has advantages over this approach when used in a map, but I don't see how. decode is not intended to be used in a map in isolation. Mapping a dataset only makes sense for:

  • prepare: during training; but doesn't have side effects
  • encode: during training; to create a task encoding dataset so it can be cached; in this case it doesn't matter whether we can id the corresponding document, because task encodings are independent units of model input and output. The TaskEncodingSequence acts like a normal sequence in this case and any additional information won't be used when serializing the TaskEncodings. A dataset.map(lambda d: taskmodule.encode(d, ...)) will result in a Dataset with the document type TaskEncoding and serialized accordingly (e.g. without document information, only inputs, targets and metadata -- and even there may be issues if metadata contains objects).
  • pipeline: for end-to-end inference and evaluation on a document level; this is where we need to re-id the corresponding document but this should always be done in a single map, e.g. dataset.map(lambda docs: my_pipeline(docs)). The TaskEncodingSequence just forwards all relevant information from encode to decode but this will never be exposed to the user.

@ArneBinder
Copy link
Owner

Thanks a lot for the explanations! It is really helpful to have such overviews written down.

I agree with most of your points. I created the other approach mainly to get familiar with the architecture. But my motivation was also, that these 'documents_in_order' are only used when calling the pipeline, so I thought it was a better way to separate concerns. However, let's go with your approach. But maybe we can make that field optional or only let 'encode_inputs' and 'encode' return a 'TaskEncodingSequence' when 'not is_training' so we do not carry them around if not needed? What do you think?

@ChristophAlt
Copy link
Collaborator Author

I updated the PR accordingly to only return a TaskEncodingSequence if is_training=False. However, this gives rise to another question, whether encode_target always implies is_training, which is incorrect in my opinion and exposes a design flaw in the taskmodule logic. But one thing at a time. 😄

@ArneBinder
Copy link
Owner

let's merge this :)

@ChristophAlt ChristophAlt merged commit 71e9abc into main Apr 29, 2022
@ChristophAlt ChristophAlt deleted the dataset_taskmodule_pipeline branch May 1, 2022 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants