Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviour of .collect() and .toList() #3105

Closed
cjw85 opened this issue Aug 8, 2022 · 13 comments
Closed

Behaviour of .collect() and .toList() #3105

cjw85 opened this issue Aug 8, 2022 · 13 comments

Comments

@cjw85
Copy link
Contributor

cjw85 commented Aug 8, 2022

Bug report

The semantics of these two operators are poorly documented and counter intuitive. From the documentation its not clear what the different between .collect() and .toList(), as the examples given are identical. .toList() does no in fact return a list.

Expected behavior and actual behavior

.collect()s behaviour is to flatten a single level of structure in the input items. This is undocumented and seems arbitrary.

The reasonably expected behaviour would be simply to return a value channel with all items from the input queue channel as a single emission.

.toList() returns a channel where it might reasonably be expected to return a... um, well, list.

Steps to reproduce the problem

nextflow.enable.dsl=2

workflow {
    ch1 = Channel.of(
        [1,2],
        [3,4,[5,6]],
        7,
        8
    )
    ch1.view()
    ch1.collect().view()
}

Program output

N E X T F L O W  ~  version 22.04.4
Launching `temp1.nf` [distraught_shockley] DSL2 - revision: 021b10ff49
[1, 2]
[3, 4, [5, 6]]
7
8
[1, 2, 3, 4, [5, 6], 7, 8]

Expected output would be:

[[1,2], [3, 4, [5,6]], 7, 8]

which is what .toList() returns.

Environment

  • Nextflow version: 22.04.4
  • Java version: java version "17.0.2" 2022-01-18 LTS
  • Operating system: macOS
  • Bash version: N/A

Additional context

(Add any other context about the problem here)

@bentsherman
Copy link
Member

I discovered the same thing recently. What I found was that collect and toList are almost identical, except that collect has two hidden options flat and sort, flat is true by default, and these options are in the operator docs but commented out. 🙁 I have uncommented them in #2816 as part of updating the docs to DSL2.

However I think the two operators are otherwise the same. They both emit a single list item. Every operator returns one or more channels due to their asynchronous nature.

By the way, thank you for raising these issues about language semantics. I tend to agree with you about the various inconsistencies in Nextflow. I think the language just developed organically and is in need of some cleanup. DSL2 was a huge step in that direction, but operators in particular need to be reviewed. In this case, I think we should keep either collect or toList and deprecate the other, and maybe deprecate toSortedList as well. Having all of these different but mostly similar operators just causes a lot of confusion.

@cjw85
Copy link
Contributor Author

cjw85 commented Aug 8, 2022

I noted similar behaviour previously with other operators: #2437 (comment)

This was originally and issue then got moved to a discussion for reasons unknown.

@bentsherman
Copy link
Member

I guess discussions have better longevity because they don't get reaped by the stalebot. Better for threads that don't have specific action items from the start. I also find these three operators confusing. It seems like we could have a single operator with an option like type: 'inner' | 'outer'. Or at least combine cross and combine into a single operator.

@cjw85
Copy link
Contributor Author

cjw85 commented Aug 8, 2022

combine is a weird one because it has two (I would say) unrelated behaviours as illustrated in the docs: without a join key it's a Cartesian product (all pairwise combinations), with a join key it's an inner product.

To me cross is confusing because both the name and the format of the output are more reminiscent of a Cartesian product, and yet it is operating as an inner join.

It's also not clear from the docs what is the difference between join and the second form of combine. The example for the former uses unique keys whereas the latter has duplicate keys, but the descriptions are in essence identical.

Anyways, we're off topic. I'll end by saying that the situation is so bad that my group had contemplated writing our own library of operators with more rational names and semantics!

@bentsherman
Copy link
Member

Well combine and cross both do outer products, but the matching key allows you to do a separate outer product for each key. If you look at the examples you'll see that within each matching key an outer product is happening. join is just doing an inner product but it's eerily similar to groupTuple. I think combine and cross are basically doing the same thing.

Anyway, I don't mind veering off because it's part of the same problem. And I'm not surprised! I'm thinking about documenting these inconsistencies in one place, so that we can find a holistic solution that considers the full library of operators.

@cjw85
Copy link
Contributor Author

cjw85 commented Aug 8, 2022

The example given for transpose is not anything I would recognise as a transpose from linear algebra. I'm hard pressed to even describe what it's doing.

I can see from the example what the description is getting at but it's just kinda weird the way the the a and b are spread across the rows of the transposed matrix; and relatedly how the transposed matrix for each singleton isn't returned as two 2-tuples. Again like the product operators there's some unintuitive flattening/expansion of nested tuples.

I think I've tried using it in the past and found it very clumsy to use in practice.

@bentsherman
Copy link
Member

Okay I've added my thoughts on all of these operators to the mega-discussion (#3107). We'll see if someone shows up with a use case for transpose but I think we can probably just get rid of it.

@pditommaso
Copy link
Member

and break 400 pipelines https://github.com/search?q=transpose+extension%3Anf

@pditommaso
Copy link
Member

To give a bit more insight about the transpose, it was motivated by the need to have an operation in some extent opposite to groupBy. See here.

the name was taken from the corresponding groovy method for list. See here

Regarding toList and collect the latter was motivated by the need to have an operator that similarly to toList would produce all items in a single emission. However collect produce nothing, when there no input. Instead toList produces an empty list.

I agree it this difference should be highlighted (actually I was convinced it was)

@bentsherman
Copy link
Member

bentsherman commented Aug 10, 2022

Thanks for clearing that up with transpose.

As for collect and toList, I don't see this difference noted in the docs, but I was able to reproduce the behavior you said. Still, it's the same issue as distinct and unique, they have slightly different behavior but you can't tell from the names which is which, it's kinda arbitrary. Think it would be better to express this difference with an option like ifEmpty instead of two different but nearly identical operators.

Or, deprecate toList and suggest collect().ifEmpty([]) instead.

@cjw85
Copy link
Contributor Author

cjw85 commented Aug 10, 2022

Regarding toList and collect the latter was motivated by the need to have an operator that similarly to toList would produce all items in a single emission. However collect produce nothing, when there no input. Instead toList produces an empty list.

This strikes at what is possibly a source of confusion, many of the operator are compound operators which could be split into more atomic parts. The one-level flattening of collect() seems like bug given your description. The fact it has a non-default flat=false option smells like a retrofit fix without wanting to change the established default.

After a further 20 minutes thought I can at least describe the transpose operator:

  • for each key
  • transpose the matrix of values
  • take each row of the matrix as a new emission
  • spread the key across the new emissions

The problem is, theres a far simpler alternative:

  • for each key
  • transpose the matrix

Which for the 2D case in the documentation example would give:

[a, [p, u], [q, v]]
[b, [s, x], [t, y]]

and a simple 1D case we go from a 1xn list to a nx1 list.

When viewed this way, transpose as implemented is a compound operator, it transposes the values, and spreads the result across a key. A simple inverse of groupTuple might be called plainly spread, defined as the first dimension of the value matrix being spread across the key.

Incidentally in the search you performed theres a number of examples where a transpose is immediately followed by a groupTuple, which (I think) results in my example above.

@stale
Copy link

stale bot commented Jan 16, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 16, 2023
@bentsherman
Copy link
Member

Closing since we clarified the docs for collect and toList in #3276 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants