GroupTuple on associative arrays #3104

diego-rt · 2022-08-08T10:35:36Z

Hi,

I'm using an associative array full of metadata as a key for my group tuple. However, samples that differ in one field are being incorrectly merged. Don't know if this is the expected behaviour but I think it is probably not a desirable one.

#!/usr/bin/env nextflow
nextflow.enable.dsl = 1

/*
 * Defines the pipeline inputs parameters
 */

Channel
    .from( [ "sample_name":"sample_A" ], [ "sample_name":"sample_B" ], [ "sample_name":"sample_C" ] )
    .set { samples_ch }

Channel
    .from( ["scaffolds", "file_s"], ["contigs", "file_c"])
    .set { genomes_ch }

samples_ch
    .combine(genomes_ch)
    .view()
    .map { sample_info, genome_name, index -> 
        sample_info."genomeName" = genome_name
        [ sample_info, index ] 
        }
    .view()
    .groupTuple()
    .view()

Program output

Observed:

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [happy_einstein] DSL1 - revision: f149707ab9
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], contigs, file_c]

[[sample_name:sample_A, genomeName:scaffolds], file_s]
[[sample_name:sample_B, genomeName:scaffolds], file_s]
[[sample_name:sample_C, genomeName:scaffolds], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

[[sample_name:sample_A, genomeName:contigs], [file_s, file_c]]
[[sample_name:sample_B, genomeName:contigs], [file_s, file_c]]
[[sample_name:sample_C, genomeName:contigs], [file_s, file_c]]

Desired:

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [happy_einstein] DSL1 - revision: f149707ab9
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], contigs, file_c]

[[sample_name:sample_A, genomeName:scaffolds], file_s]
[[sample_name:sample_B, genomeName:scaffolds], file_s]
[[sample_name:sample_C, genomeName:scaffolds], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

[[sample_name:sample_A, genomeName:scaffolds], file_s]
[[sample_name:sample_B, genomeName:scaffolds], file_s]
[[sample_name:sample_C, genomeName:scaffolds], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

The text was updated successfully, but these errors were encountered:

diego-rt · 2022-08-08T16:32:30Z

Been playing with variations of this snippet trying to fix the behaviour and it actually gets more bizarre: I get a different output every time you run this snippet. Seems to depend on the order of the execution.

#!/usr/bin/env nextflow
nextflow.enable.dsl = 1

/*
 * Defines the pipeline inputs parameters
 */

Channel
    .from( [ "sample_name":"sample_A" ], [ "sample_name":"sample_B" ], [ "sample_name":"sample_C" ] )
    .set { samples_ch }

Channel
    .from( ["scaffolds", "file_s"], ["contigs", "file_c"])
    .set { genomes_ch }

samples_ch
    .combine( genomes_ch )
    .view()
    .map { sample_info, genome_name, index -> 
        def newMap = sample_info
        def newVar = genome_name
        newMap.genomeName = newVar
        return [ newMap, index ] 
        }
    .view()

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [zen_shockley] DSL1 - revision: b3cd5c6014
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, genomeName:contigs], file_s]
[[sample_name:sample_B, genomeName:contigs], file_s]
[[sample_name:sample_C, genomeName:contigs], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [elegant_stone] DSL1 - revision: b3cd5c6014
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, genomeName:scaffolds], file_s]
[[sample_name:sample_B, genomeName:scaffolds], file_s]
[[sample_name:sample_C, genomeName:scaffolds], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [agitated_gates] DSL1 - revision: b3cd5c6014
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, genomeName:contigs], file_s]
[[sample_name:sample_B, genomeName:contigs], file_s]
[[sample_name:sample_C, genomeName:contigs], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_c]

sh-3.2$ ./nextflow run nf-playground.nf 
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [serene_heyrovsky] DSL1 - revision: b3cd5c6014
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, genomeName:scaffolds], file_s]
[[sample_name:sample_B, genomeName:contigs], file_s]
[[sample_name:sample_A, genomeName:contigs], file_c]
[[sample_name:sample_B, genomeName:contigs], file_c]
[[sample_name:sample_C, genomeName:contigs], file_s]
[[sample_name:sample_C, genomeName:contigs], file_c]

diego-rt · 2022-08-08T16:55:04Z

If you could point me to an alternative to reliably add a key value pair to a dictionary, I would really appreciate it!

bentsherman · 2022-08-08T21:16:10Z

This is strange and warrants investigation. As a quick workaround, I would just work with a flattened list and convert to a map at the end:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 1

Channel
    .from( [ "sample_A" ], [ "sample_B" ], [ "sample_C" ] )
    .set { samples_ch }

Channel
    .from( ["scaffolds", "file_s"], ["contigs", "file_c"])
    .set { genomes_ch }

samples_ch
    .combine(genomes_ch)
    .view()
    .view()
    .groupTuple(by: [0, 1])
    .map { sample_info, genome_name, index -> ['sample_name': sample_info, 'genome': genome_name, 'index': index] }
    .view()

bentsherman · 2022-08-08T21:17:30Z

This example might be too sample because you basically just wanted to combine two channels, the groupTuple in your desired output does nothing.

diego-rt · 2022-08-08T21:23:39Z

Hi Ben, thanks for your quick reply! Yes this is a very simple example just to illustrate what was happening in my main pipeline with some groupTuples and groupkeys getting incorrectly grouped.

I can send you more info tomorrow morning, but the underlying issue is that after the combine, if you try to add a new key-value pair to a map, you get completely random results every time. But only if you add a new entry to the map. This later goes on to affect your groupTuples and stuff like that as a downstream consequence.

diego-rt · 2022-08-09T06:36:38Z

Hi Ben, the problem is I have like 15 fields in that map and thus turning it into a list will be a nightmare. I haven't managed to find any way to append new data into a map that doesn't result in this issue, in both 21.10.6. and 22.04.5. I'm very surprised no one noticed so far!

In essence, the problem is that any attempt to mutate a map by adding a new key-value pair results in data from a different channel instance being inserted instead. I initially thought it was an issue with groupTuple but I realised last afternoon that this is not the case, and that rather the problem is that wrong data is being inserted into the map and this messes up the groupTuple afterwards.

Here is a minimal reproducible snippet without groupTuple where the 'index' map entry gets messed up:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 1

/*
 * Defines the pipeline inputs parameters
 */

Channel
    .from( [ "sample_name":"sample_A" ], [ "sample_name":"sample_B" ], [ "sample_name":"sample_C" ] )
    .set { samples_ch }

Channel
    .from( ["scaffolds", "file_s"], ["contigs", "file_c"])
    .set { genomes_ch }

samples_ch
    .combine( genomes_ch )
    .view()
    .map { sample_info, genome_name, index -> 
            sample_info.index = index
            return [sample_info, genome_name, index] }
    .view()

Output:

diego@station0 src % ./nextflow nf-playground.nf
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [astonishing_torvalds] DSL1 - revision: 37f7ae2dfe
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, index:file_s], scaffolds, file_s]
[[sample_name:sample_B, index:file_s], scaffolds, file_s]
[[sample_name:sample_A, index:file_c], contigs, file_c]
[[sample_name:sample_B, index:file_c], contigs, file_c]
[[sample_name:sample_C, index:file_c], scaffolds, file_s]
[[sample_name:sample_C, index:file_c], contigs, file_c]
diego@station0 src % ./nextflow nf-playground.nf
N E X T F L O W  ~  version 22.04.5
Launching `nf-playground.nf` [thirsty_woese] DSL1 - revision: 37f7ae2dfe
[[sample_name:sample_A], scaffolds, file_s]
[[sample_name:sample_A], contigs, file_c]
[[sample_name:sample_B], scaffolds, file_s]
[[sample_name:sample_B], contigs, file_c]
[[sample_name:sample_C], scaffolds, file_s]
[[sample_name:sample_C], contigs, file_c]
[[sample_name:sample_A, index:file_c], scaffolds, file_s]
[[sample_name:sample_A, index:file_c], contigs, file_c]
[[sample_name:sample_B, index:file_c], scaffolds, file_s]
[[sample_name:sample_B, index:file_c], contigs, file_c]
[[sample_name:sample_C, index:file_c], scaffolds, file_s]
[[sample_name:sample_C, index:file_c], contigs, file_c]

bentsherman · 2022-08-09T12:25:34Z

You know what, this might be the same as #2660 , where they found that modifying a map also updates all shallow copies of it in other channels. Their workaround was to use the clone() method:

    .map { sample_info, genome_name, index -> 
            def tmp = sample_info.clone()
            tmp.index = index
            return [tmp, genome_name, index] }

diego-rt · 2022-08-09T21:40:26Z

Hey yes, that seems to fix it!

But honestly this is very concerning... It could easily corrupt a dataset and it would be very hard to realise if you don't run some sort of comparison between emissions like in my groupTuple situation... I hope its fixed soon!

bentsherman · 2022-08-09T21:45:15Z

Indeed. I'm going to close this issue since it's a duplicate. Feel free to continue discussing in the earlier issue while we look for a solution.

diego-rt · 2022-08-15T17:51:48Z

Hi Ben,

Just want to note that .clone does not work on maps that carry a groupKey... Is there any way to remove a groupKey from an object btw?

diego-rt · 2022-08-15T19:47:03Z

Leaving this here in case anybody stumbles into the same problem. If the object you want to clone has a groupKey, you need to extract it by accessing the '.target' entry. The size can likewise be accessed with '.size' in case you want to clone that as well.

This works for groupKey.containing maps:

    .map { sample_info, genome_name, index -> [groupKey(sample_info, 1), genome_name, index] }
    .map { sample_info, genome_name, index -> 
            def tmp = sample_info.target.clone()
            tmp.index = index
            return [tmp, genome_name, index] }

bentsherman · 2023-03-09T16:51:50Z

Hi @diego-rt , revisiting this thread and I think we can solve the cloning problem with groupKey by making the GroupKey class cloneable. I will submit a PR for it.

diego-rt · 2023-03-13T16:14:21Z

Perfect, thank you so much!

I would only suggest that you publicise a bit more the fact that if you dont use clone then you are dealing with shallow copies that will update all instances of the object when edited. I was totally unaware of this and it could cause huge artefacts...

bentsherman added the lang/operators label Aug 8, 2022

bentsherman closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2022

bentsherman added the triage/duplicate label Aug 9, 2022

bentsherman mentioned this issue Mar 9, 2023

Make GroupKey cloneable #3738

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupTuple on associative arrays #3104

GroupTuple on associative arrays #3104

diego-rt commented Aug 8, 2022

diego-rt commented Aug 8, 2022 •

edited

Loading

diego-rt commented Aug 8, 2022

bentsherman commented Aug 8, 2022

bentsherman commented Aug 8, 2022

diego-rt commented Aug 8, 2022

diego-rt commented Aug 9, 2022

bentsherman commented Aug 9, 2022

diego-rt commented Aug 9, 2022

bentsherman commented Aug 9, 2022

diego-rt commented Aug 15, 2022

diego-rt commented Aug 15, 2022

bentsherman commented Mar 9, 2023

diego-rt commented Mar 13, 2023

GroupTuple on associative arrays #3104

GroupTuple on associative arrays #3104

Comments

diego-rt commented Aug 8, 2022

Program output

diego-rt commented Aug 8, 2022 • edited Loading

diego-rt commented Aug 8, 2022

bentsherman commented Aug 8, 2022

bentsherman commented Aug 8, 2022

diego-rt commented Aug 8, 2022

diego-rt commented Aug 9, 2022

bentsherman commented Aug 9, 2022

diego-rt commented Aug 9, 2022

bentsherman commented Aug 9, 2022

diego-rt commented Aug 15, 2022

diego-rt commented Aug 15, 2022

bentsherman commented Mar 9, 2023

diego-rt commented Mar 13, 2023

diego-rt commented Aug 8, 2022 •

edited

Loading