Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupTuple on associative arrays #3104

Closed
diego-rt opened this issue Aug 8, 2022 · 13 comments
Closed

GroupTuple on associative arrays #3104

diego-rt opened this issue Aug 8, 2022 · 13 comments

Comments

@diego-rt
Copy link

diego-rt commented Aug 8, 2022

Hi,

I'm using an associative array full of metadata as a key for my group tuple. However, samples that differ in one field are being incorrectly merged. Don't know if this is the expected behaviour but I think it is probably not a desirable one.

#!/usr/bin/env nextflow
nextflow.enable.dsl = 1

/*
 * Defines the pipeline inputs parameters
 */

Channel
    .from( [ "sample_name":"sample_A" ], [ "sample_name":"sample_B" ], [ "sample_name":"sample_C" ] )
    .set { samples_ch }

Channel
    .from( ["scaffolds", "file_s"], ["contigs", "file_c"])
    .set { genomes_ch }

samples_ch
    .combine(genomes_ch)
    .view()
    .map { sample_info, genome_name, index -> 
        sample_info."genomeName" = genome_name
        [ sample_info, index ] 
        }
    .view()
    .groupTuple()
    .view()
    

Program output

Observed:

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [happy_einstein] DSL1 - revision: f149707ab9
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], contigs, file_c]

[[sample_name:sample_A, genomeName:scaffolds], file_s]
[[sample_name:sample_B, genomeName:scaffolds], file_s]
[[sample_name:sample_C, genomeName:scaffolds], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

[[sample_name:sample_A, genomeName:contigs], [file_s, file_c]]
[[sample_name:sample_B, genomeName:contigs], [file_s, file_c]]
[[sample_name:sample_C, genomeName:contigs], [file_s, file_c]]

Desired:

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [happy_einstein] DSL1 - revision: f149707ab9
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], contigs, file_c]

[[sample_name:sample_A, genomeName:scaffolds], file_s]
[[sample_name:sample_B, genomeName:scaffolds], file_s]
[[sample_name:sample_C, genomeName:scaffolds], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

[[sample_name:sample_A, genomeName:scaffolds], file_s]
[[sample_name:sample_B, genomeName:scaffolds], file_s]
[[sample_name:sample_C, genomeName:scaffolds], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]
@diego-rt
Copy link
Author

diego-rt commented Aug 8, 2022

Been playing with variations of this snippet trying to fix the behaviour and it actually gets more bizarre: I get a different output every time you run this snippet. Seems to depend on the order of the execution.

#!/usr/bin/env nextflow
nextflow.enable.dsl = 1

/*
 * Defines the pipeline inputs parameters
 */

Channel
    .from( [ "sample_name":"sample_A" ], [ "sample_name":"sample_B" ], [ "sample_name":"sample_C" ] )
    .set { samples_ch }

Channel
    .from( ["scaffolds", "file_s"], ["contigs", "file_c"])
    .set { genomes_ch }

samples_ch
    .combine( genomes_ch )
    .view()
    .map { sample_info, genome_name, index -> 
        def newMap = sample_info
        def newVar = genome_name
        newMap.genomeName = newVar
        return [ newMap, index ] 
        }
    .view()
sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [zen_shockley] DSL1 - revision: b3cd5c6014
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, genomeName:contigs], file_s]
[[sample_name:sample_B, genomeName:contigs], file_s]
[[sample_name:sample_C, genomeName:contigs], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [elegant_stone] DSL1 - revision: b3cd5c6014
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, genomeName:scaffolds], file_s]
[[sample_name:sample_B, genomeName:scaffolds], file_s]
[[sample_name:sample_C, genomeName:scaffolds], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [agitated_gates] DSL1 - revision: b3cd5c6014
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, genomeName:contigs], file_s]
[[sample_name:sample_B, genomeName:contigs], file_s]
[[sample_name:sample_C, genomeName:contigs], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [serene_heyrovsky] DSL1 - revision: b3cd5c6014
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, genomeName:scaffolds], file_s]
[[sample_name:sample_B, genomeName:contigs], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_s]
[[sample_name:sample_C, genomeName:contigs], file_c]

@diego-rt
Copy link
Author

diego-rt commented Aug 8, 2022

If you could point me to an alternative to reliably add a key value pair to a dictionary, I would really appreciate it!

@bentsherman
Copy link
Member

This is strange and warrants investigation. As a quick workaround, I would just work with a flattened list and convert to a map at the end:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 1

Channel
    .from( [ "sample_A" ], [ "sample_B" ], [ "sample_C" ] )
    .set { samples_ch }

Channel
    .from( ["scaffolds", "file_s"], ["contigs", "file_c"])
    .set { genomes_ch }

samples_ch
    .combine(genomes_ch)
    .view()
    .view()
    .groupTuple(by: [0, 1])
    .map { sample_info, genome_name, index -> ['sample_name': sample_info, 'genome': genome_name, 'index': index] }
    .view()

@bentsherman
Copy link
Member

This example might be too sample because you basically just wanted to combine two channels, the groupTuple in your desired output does nothing.

@diego-rt
Copy link
Author

diego-rt commented Aug 8, 2022

Hi Ben, thanks for your quick reply! Yes this is a very simple example just to illustrate what was happening in my main pipeline with some groupTuples and groupkeys getting incorrectly grouped.

I can send you more info tomorrow morning, but the underlying issue is that after the combine, if you try to add a new key-value pair to a map, you get completely random results every time. But only if you add a new entry to the map. This later goes on to affect your groupTuples and stuff like that as a downstream consequence.

@diego-rt
Copy link
Author

diego-rt commented Aug 9, 2022

Hi Ben, the problem is I have like 15 fields in that map and thus turning it into a list will be a nightmare. I haven't managed to find any way to append new data into a map that doesn't result in this issue, in both 21.10.6. and 22.04.5. I'm very surprised no one noticed so far!

In essence, the problem is that any attempt to mutate a map by adding a new key-value pair results in data from a different channel instance being inserted instead. I initially thought it was an issue with groupTuple but I realised last afternoon that this is not the case, and that rather the problem is that wrong data is being inserted into the map and this messes up the groupTuple afterwards.

Here is a minimal reproducible snippet without groupTuple where the 'index' map entry gets messed up:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 1

/*
 * Defines the pipeline inputs parameters
 */

Channel
    .from( [ "sample_name":"sample_A" ], [ "sample_name":"sample_B" ], [ "sample_name":"sample_C" ] )
    .set { samples_ch }

Channel
    .from( ["scaffolds", "file_s"], ["contigs", "file_c"])
    .set { genomes_ch }

samples_ch
    .combine( genomes_ch )
    .view()
    .map { sample_info, genome_name, index -> 
            sample_info.index = index
            return [sample_info, genome_name, index] }
    .view()

Output:

diego@station0 src % ./nextflow nf-playground.nf
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [astonishing_torvalds] DSL1 - revision: 37f7ae2dfe
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, index:file_s], scaffolds, file_s]
[[sample_name:sample_B, index:file_s], scaffolds, file_s]
[[sample_name:sample_A, index:file_c], contigs, file_c]
[[sample_name:sample_B, index:file_c], contigs, file_c]
[[sample_name:sample_C, index:file_c], scaffolds, file_s]
[[sample_name:sample_C, index:file_c], contigs, file_c]
diego@station0 src % ./nextflow nf-playground.nf
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [thirsty_woese] DSL1 - revision: 37f7ae2dfe
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, index:file_c], scaffolds, file_s]
[[sample_name:sample_A, index:file_c], contigs, file_c]
[[sample_name:sample_B, index:file_c], scaffolds, file_s]
[[sample_name:sample_B, index:file_c], contigs, file_c]
[[sample_name:sample_C, index:file_c], scaffolds, file_s]
[[sample_name:sample_C, index:file_c], contigs, file_c]

@bentsherman
Copy link
Member

You know what, this might be the same as #2660 , where they found that modifying a map also updates all shallow copies of it in other channels. Their workaround was to use the clone() method:

    .map { sample_info, genome_name, index -> 
            def tmp = sample_info.clone()
            tmp.index = index
            return [tmp, genome_name, index] }

@diego-rt
Copy link
Author

diego-rt commented Aug 9, 2022

Hey yes, that seems to fix it!

But honestly this is very concerning... It could easily corrupt a dataset and it would be very hard to realise if you don't run some sort of comparison between emissions like in my groupTuple situation... I hope its fixed soon!

@bentsherman
Copy link
Member

Indeed. I'm going to close this issue since it's a duplicate. Feel free to continue discussing in the earlier issue while we look for a solution.

@bentsherman bentsherman closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2022
@diego-rt
Copy link
Author

Hi Ben,

Just want to note that .clone does not work on maps that carry a groupKey... Is there any way to remove a groupKey from an object btw?

@diego-rt
Copy link
Author

Leaving this here in case anybody stumbles into the same problem. If the object you want to clone has a groupKey, you need to extract it by accessing the '.target' entry. The size can likewise be accessed with '.size' in case you want to clone that as well.

This works for groupKey.containing maps:

    .map { sample_info, genome_name, index -> [groupKey(sample_info, 1), genome_name, index] }
    .map { sample_info, genome_name, index -> 
            def tmp = sample_info.target.clone()
            tmp.index = index
            return [tmp, genome_name, index] }

@bentsherman
Copy link
Member

Hi @diego-rt , revisiting this thread and I think we can solve the cloning problem with groupKey by making the GroupKey class cloneable. I will submit a PR for it.

@diego-rt
Copy link
Author

Perfect, thank you so much!

I would only suggest that you publicise a bit more the fact that if you dont use clone then you are dealing with shallow copies that will update all instances of the object when edited. I was totally unaware of this and it could cause huge artefacts...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants