Reworked TaskModule and Pipeline #149

ChristophAlt · 2022-04-27T10:46:40Z

Pipeline now works on Datasets. A InplaceNotSupportedException exception is raised if the user attempts to modify a Dataset inplace.
TaskModule's encode_input and encode_target now receive a single document as input. This may be extended by a batched version in the future.
The parameter naming of methods in TaskModule is now more consistent and returning a TaskEncoding in encode_input considerably simplifies interaction between different functions.
Where possible the container types have been relaxed to Sequence
Input documents do not have to be passed around anymore (e.g. to decode). Instead encode_target returns a TaskEncodingSequence that allows us to determine the ordering of documents that have been passed to encode. This is particularly important for TaskModules that either generate multiple TaskEncodings per input Document, or do not generate any. One example is binary relation extraction, where each combination of two spans produces a TaskEncoding but Documents without spans produce no TaskEncoding at all. In the later case we still want to copy and return the corresponding document, for instance, if inplace=True.
Note: If a Dataset is passed to TaskModule decode, inplace will be ignored implicitly, as there is now way to determine that a Dataset was used to create the TaskEncodings. Edit: This is not entirely true, TaskEncodingSequence could be used to pass on the info, e.g. as inplace_allowed={True, False}.

- Pipeline now works on Datasets. A InplaceNotSupportedException exception is raised if the user attempts to modify a Dataset inplace. - TaskModule's encode_input and encode_target now receive a single document as input. This may be extended by a batched version in the future. - The parameter naming of methods in TaskModule is now more consistent. - Where possible the container types have been relaxed to Sequence - Input documents do not have to be passed around anymore (e.g. to decode). Instead encode_target returns a TaskEncodingSequence that allows us to determine the ordering of documents that have been passed to encode. This is particularly important for TaskModules that either generate multiple TaskEncodings per input Document, or do not generate any. One example is binary relation extraction, where each combination of two spans produces a TaskEncoding but Documents without spans produce no TaskEncoding at all. In the later case we still want to copy and return it if for instance inplace=True.

ArneBinder

Review part 0.1 (early question, tomorrow comes more)

src/pytorch_ie/pipeline.py

src/pytorch_ie/taskmodules/taskmodule.py

ArneBinder

This looks really nice. However, I'm not so sure about encode_inputs (and encode) returning TaskEncodingSequence in instead of just the simple Sequence[TaskEncoding]. Could you have a look at #150 (which implements your changes without that)? I'm not 100% for one of the two versions, but let's discuss both of them.

ChristophAlt · 2022-04-29T06:46:53Z

#150 would also work but I can't see the advantage. It adds more complexity to methods and now you have to remember which collection doc_idx is refering to. That's what TaskEncodingSequence is for, you have a single source of truth to decode the output. You mention that it has advantages over this approach when used in a map, but I don't see how. decode is not intended to be used in a map in isolation. Mapping a dataset only makes sense for:

prepare: during training; but doesn't have side effects
encode: during training; to create a task encoding dataset so it can be cached; in this case it doesn't matter whether we can id the corresponding document, because task encodings are independent units of model input and output. The TaskEncodingSequence acts like a normal sequence in this case and any additional information won't be used when serializing the TaskEncodings. A dataset.map(lambda d: taskmodule.encode(d, ...)) will result in a Dataset with the document type TaskEncoding and serialized accordingly (e.g. without document information, only inputs, targets and metadata -- and even there may be issues if metadata contains objects).
pipeline: for end-to-end inference and evaluation on a document level; this is where we need to re-id the corresponding document but this should always be done in a single map, e.g. dataset.map(lambda docs: my_pipeline(docs)). The TaskEncodingSequence just forwards all relevant information from encode to decode but this will never be exposed to the user.

ArneBinder · 2022-04-29T09:57:38Z

Thanks a lot for the explanations! It is really helpful to have such overviews written down.

I agree with most of your points. I created the other approach mainly to get familiar with the architecture. But my motivation was also, that these 'documents_in_order' are only used when calling the pipeline, so I thought it was a better way to separate concerns. However, let's go with your approach. But maybe we can make that field optional or only let 'encode_inputs' and 'encode' return a 'TaskEncodingSequence' when 'not is_training' so we do not carry them around if not needed? What do you think?

ChristophAlt · 2022-04-29T11:08:53Z

I updated the PR accordingly to only return a TaskEncodingSequence if is_training=False. However, this gives rise to another question, whether encode_target always implies is_training, which is incorrect in my opinion and exposes a design flaw in the taskmodule logic. But one thing at a time. 😄

ArneBinder · 2022-04-29T11:17:38Z

let's merge this :)

ChristophAlt requested a review from ArneBinder April 27, 2022 10:50

ArneBinder reviewed Apr 27, 2022

View reviewed changes

src/pytorch_ie/pipeline.py Show resolved Hide resolved

src/pytorch_ie/taskmodules/taskmodule.py Outdated Show resolved Hide resolved

Fix: TaskEncoding inputs should not be optional

f1eb240

ArneBinder mentioned this pull request Apr 28, 2022

Dataset taskmodule pipeline refactor v2 #150

Closed

ArneBinder reviewed Apr 28, 2022

View reviewed changes

Return TaskEncodingSequence only if not in training mode

175e52c

ChristophAlt merged commit 71e9abc into main Apr 29, 2022

ChristophAlt mentioned this pull request Apr 29, 2022

Handling of dataset in TaskModule and Pipeline #140

Closed

ChristophAlt deleted the dataset_taskmodule_pipeline branch May 1, 2022 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reworked TaskModule and Pipeline #149

Reworked TaskModule and Pipeline #149

ChristophAlt commented Apr 27, 2022 •

edited

Loading

ArneBinder left a comment •

edited

Loading

ArneBinder left a comment •

edited

Loading

ChristophAlt commented Apr 29, 2022 •

edited

Loading

ArneBinder commented Apr 29, 2022

ChristophAlt commented Apr 29, 2022

ArneBinder commented Apr 29, 2022

Reworked TaskModule and Pipeline #149

Reworked TaskModule and Pipeline #149

Conversation

ChristophAlt commented Apr 27, 2022 • edited Loading

ArneBinder left a comment • edited Loading

Choose a reason for hiding this comment

ArneBinder left a comment • edited Loading

Choose a reason for hiding this comment

ChristophAlt commented Apr 29, 2022 • edited Loading

ArneBinder commented Apr 29, 2022

ChristophAlt commented Apr 29, 2022

ArneBinder commented Apr 29, 2022

ChristophAlt commented Apr 27, 2022 •

edited

Loading

ArneBinder left a comment •

edited

Loading

ArneBinder left a comment •

edited

Loading

ChristophAlt commented Apr 29, 2022 •

edited

Loading