Behaviour of `.collect()` and `.toList()` #3105

cjw85 · 2022-08-08T17:38:00Z

Bug report

The semantics of these two operators are poorly documented and counter intuitive. From the documentation its not clear what the different between .collect() and .toList(), as the examples given are identical. .toList() does no in fact return a list.

Expected behavior and actual behavior

.collect()s behaviour is to flatten a single level of structure in the input items. This is undocumented and seems arbitrary.

The reasonably expected behaviour would be simply to return a value channel with all items from the input queue channel as a single emission.

.toList() returns a channel where it might reasonably be expected to return a... um, well, list.

Steps to reproduce the problem

nextflow.enable.dsl=2

workflow {
    ch1 = Channel.of(
        [1,2],
        [3,4,[5,6]],
        7,
        8
    )
    ch1.view()
    ch1.collect().view()
}

Program output

N E X T F L O W  ~  version 22.04.4
Launching `temp1.nf` [distraught_shockley] DSL2 - revision: 021b10ff49
[1, 2]
[3, 4, [5, 6]]
7
8
[1, 2, 3, 4, [5, 6], 7, 8]

Expected output would be:

[[1,2], [3, 4, [5,6]], 7, 8]

which is what .toList() returns.

Environment

Nextflow version: 22.04.4
Java version: java version "17.0.2" 2022-01-18 LTS
Operating system: macOS
Bash version: N/A

Additional context

(Add any other context about the problem here)

The text was updated successfully, but these errors were encountered:

bentsherman · 2022-08-08T20:34:10Z

I discovered the same thing recently. What I found was that collect and toList are almost identical, except that collect has two hidden options flat and sort, flat is true by default, and these options are in the operator docs but commented out. 🙁 I have uncommented them in #2816 as part of updating the docs to DSL2.

However I think the two operators are otherwise the same. They both emit a single list item. Every operator returns one or more channels due to their asynchronous nature.

By the way, thank you for raising these issues about language semantics. I tend to agree with you about the various inconsistencies in Nextflow. I think the language just developed organically and is in need of some cleanup. DSL2 was a huge step in that direction, but operators in particular need to be reviewed. In this case, I think we should keep either collect or toList and deprecate the other, and maybe deprecate toSortedList as well. Having all of these different but mostly similar operators just causes a lot of confusion.

cjw85 · 2022-08-08T20:43:59Z

I noted similar behaviour previously with other operators: #2437 (comment)

This was originally and issue then got moved to a discussion for reasons unknown.

bentsherman · 2022-08-08T20:58:40Z

I guess discussions have better longevity because they don't get reaped by the stalebot. Better for threads that don't have specific action items from the start. I also find these three operators confusing. It seems like we could have a single operator with an option like type: 'inner' | 'outer'. Or at least combine cross and combine into a single operator.

cjw85 · 2022-08-08T21:41:28Z

combine is a weird one because it has two (I would say) unrelated behaviours as illustrated in the docs: without a join key it's a Cartesian product (all pairwise combinations), with a join key it's an inner product.

To me cross is confusing because both the name and the format of the output are more reminiscent of a Cartesian product, and yet it is operating as an inner join.

It's also not clear from the docs what is the difference between join and the second form of combine. The example for the former uses unique keys whereas the latter has duplicate keys, but the descriptions are in essence identical.

Anyways, we're off topic. I'll end by saying that the situation is so bad that my group had contemplated writing our own library of operators with more rational names and semantics!

bentsherman · 2022-08-08T22:12:34Z

Well combine and cross both do outer products, but the matching key allows you to do a separate outer product for each key. If you look at the examples you'll see that within each matching key an outer product is happening. join is just doing an inner product but it's eerily similar to groupTuple. I think combine and cross are basically doing the same thing.

Anyway, I don't mind veering off because it's part of the same problem. And I'm not surprised! I'm thinking about documenting these inconsistencies in one place, so that we can find a holistic solution that considers the full library of operators.

cjw85 · 2022-08-08T22:51:21Z

The example given for transpose is not anything I would recognise as a transpose from linear algebra. I'm hard pressed to even describe what it's doing.

I can see from the example what the description is getting at but it's just kinda weird the way the the a and b are spread across the rows of the transposed matrix; and relatedly how the transposed matrix for each singleton isn't returned as two 2-tuples. Again like the product operators there's some unintuitive flattening/expansion of nested tuples.

I think I've tried using it in the past and found it very clumsy to use in practice.

bentsherman · 2022-08-10T19:44:47Z

Okay I've added my thoughts on all of these operators to the mega-discussion (#3107). We'll see if someone shows up with a use case for transpose but I think we can probably just get rid of it.

pditommaso · 2022-08-10T19:49:56Z

and break 400 pipelines https://github.com/search?q=transpose+extension%3Anf

pditommaso · 2022-08-10T20:21:42Z

To give a bit more insight about the transpose, it was motivated by the need to have an operation in some extent opposite to groupBy. See here.

the name was taken from the corresponding groovy method for list. See here

Regarding toList and collect the latter was motivated by the need to have an operator that similarly to toList would produce all items in a single emission. However collect produce nothing, when there no input. Instead toList produces an empty list.

I agree it this difference should be highlighted (actually I was convinced it was)

bentsherman · 2022-08-10T20:58:10Z

Thanks for clearing that up with transpose.

As for collect and toList, I don't see this difference noted in the docs, but I was able to reproduce the behavior you said. Still, it's the same issue as distinct and unique, they have slightly different behavior but you can't tell from the names which is which, it's kinda arbitrary. Think it would be better to express this difference with an option like ifEmpty instead of two different but nearly identical operators.

Or, deprecate toList and suggest collect().ifEmpty([]) instead.

cjw85 · 2022-08-10T21:27:11Z

Regarding toList and collect the latter was motivated by the need to have an operator that similarly to toList would produce all items in a single emission. However collect produce nothing, when there no input. Instead toList produces an empty list.

This strikes at what is possibly a source of confusion, many of the operator are compound operators which could be split into more atomic parts. The one-level flattening of collect() seems like bug given your description. The fact it has a non-default flat=false option smells like a retrofit fix without wanting to change the established default.

After a further 20 minutes thought I can at least describe the transpose operator:

for each key
transpose the matrix of values
take each row of the matrix as a new emission
spread the key across the new emissions

The problem is, theres a far simpler alternative:

for each key
transpose the matrix

Which for the 2D case in the documentation example would give:

[a, [p, u], [q, v]]
[b, [s, x], [t, y]]

and a simple 1D case we go from a 1xn list to a nx1 list.

When viewed this way, transpose as implemented is a compound operator, it transposes the values, and spreads the result across a key. A simple inverse of groupTuple might be called plainly spread, defined as the first dimension of the value matrix being spread across the key.

Incidentally in the search you performed theres a number of examples where a transpose is immediately followed by a groupTuple, which (I think) results in my example above.

stale · 2023-01-16T01:57:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bentsherman · 2023-01-17T15:19:33Z

Closing since we clarified the docs for collect and toList in #3276 .

bentsherman added the lang/operators label Aug 8, 2022

bentsherman added the docs label Aug 8, 2022

mribeirodantas mentioned this issue Oct 7, 2022

Clarify the difference between collect and toList #3276

Merged

stale bot added the stale label Jan 16, 2023

bentsherman closed this as completed Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviour of `.collect()` and `.toList()` #3105

Behaviour of `.collect()` and `.toList()` #3105

cjw85 commented Aug 8, 2022 •

edited by bentsherman

Loading

bentsherman commented Aug 8, 2022

cjw85 commented Aug 8, 2022

bentsherman commented Aug 8, 2022

cjw85 commented Aug 8, 2022

bentsherman commented Aug 8, 2022

cjw85 commented Aug 8, 2022

bentsherman commented Aug 10, 2022

pditommaso commented Aug 10, 2022

pditommaso commented Aug 10, 2022

bentsherman commented Aug 10, 2022 •

edited

Loading

cjw85 commented Aug 10, 2022

stale bot commented Jan 16, 2023

bentsherman commented Jan 17, 2023

Behaviour of .collect() and .toList() #3105

Behaviour of .collect() and .toList() #3105

Comments

cjw85 commented Aug 8, 2022 • edited by bentsherman Loading

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

bentsherman commented Aug 8, 2022

cjw85 commented Aug 8, 2022

bentsherman commented Aug 8, 2022

cjw85 commented Aug 8, 2022

bentsherman commented Aug 8, 2022

cjw85 commented Aug 8, 2022

bentsherman commented Aug 10, 2022

pditommaso commented Aug 10, 2022

pditommaso commented Aug 10, 2022

bentsherman commented Aug 10, 2022 • edited Loading

cjw85 commented Aug 10, 2022

stale bot commented Jan 16, 2023

bentsherman commented Jan 17, 2023

Behaviour of `.collect()` and `.toList()` #3105

Behaviour of `.collect()` and `.toList()` #3105

cjw85 commented Aug 8, 2022 •

edited by bentsherman

Loading

bentsherman commented Aug 10, 2022 •

edited

Loading