-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reworked TaskModule and Pipeline #149
Conversation
- Pipeline now works on Datasets. A InplaceNotSupportedException exception is raised if the user attempts to modify a Dataset inplace. - TaskModule's encode_input and encode_target now receive a single document as input. This may be extended by a batched version in the future. - The parameter naming of methods in TaskModule is now more consistent. - Where possible the container types have been relaxed to Sequence - Input documents do not have to be passed around anymore (e.g. to decode). Instead encode_target returns a TaskEncodingSequence that allows us to determine the ordering of documents that have been passed to encode. This is particularly important for TaskModules that either generate multiple TaskEncodings per input Document, or do not generate any. One example is binary relation extraction, where each combination of two spans produces a TaskEncoding but Documents without spans produce no TaskEncoding at all. In the later case we still want to copy and return it if for instance inplace=True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review part 0.1 (early question, tomorrow comes more)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really nice. However, I'm not so sure about encode_inputs
(and encode
) returning TaskEncodingSequence
in instead of just the simple Sequence[TaskEncoding]
. Could you have a look at #150 (which implements your changes without that)? I'm not 100% for one of the two versions, but let's discuss both of them.
#150 would also work but I can't see the advantage. It adds more complexity to methods and now you have to remember which collection doc_idx is refering to. That's what TaskEncodingSequence is for, you have a single source of truth to decode the output. You mention that it has advantages over this approach when used in a map, but I don't see how.
|
Thanks a lot for the explanations! It is really helpful to have such overviews written down. I agree with most of your points. I created the other approach mainly to get familiar with the architecture. But my motivation was also, that these 'documents_in_order' are only used when calling the pipeline, so I thought it was a better way to separate concerns. However, let's go with your approach. But maybe we can make that field optional or only let 'encode_inputs' and 'encode' return a 'TaskEncodingSequence' when 'not is_training' so we do not carry them around if not needed? What do you think? |
I updated the PR accordingly to only return a TaskEncodingSequence if |
let's merge this :) |
Pipeline now works on Datasets. A InplaceNotSupportedException exception is raised if the user attempts to modify a Dataset inplace.
TaskModule's encode_input and encode_target now receive a single document as input. This may be extended by a batched version in the future.
The parameter naming of methods in TaskModule is now more consistent and returning a TaskEncoding in encode_input considerably simplifies interaction between different functions.
Where possible the container types have been relaxed to Sequence
Input documents do not have to be passed around anymore (e.g. to decode). Instead encode_target returns a TaskEncodingSequence that allows us to determine the ordering of documents that have been passed to encode. This is particularly important for TaskModules that either generate multiple TaskEncodings per input Document, or do not generate any. One example is binary relation extraction, where each combination of two spans produces a TaskEncoding but Documents without spans produce no TaskEncoding at all. In the later case we still want to copy and return the corresponding document, for instance, if inplace=True.
Note: If a Dataset is passed to TaskModule decode, inplace will be ignored implicitly, as there is now way to determine that a Dataset was used to create the TaskEncodings. Edit: This is not entirely true, TaskEncodingSequence could be used to pass on the info, e.g. as
inplace_allowed={True, False}
.