Pre-process transcripts fasta when using --gencode #864

andreas-wilm · 2022-09-01T05:17:41Z

Description of the bug

Hi all,

I've downloaded references from Gencode (i.e. genome, transcript, annotation) and used nf-core/rnaseq with --aligner star_salmon --pseudo_aligner salmon --gencode. Unfortunately the STAR quantification with salmon fails, because of the GENCODE typical pipe separation of transcript names. See below for the error. The --gencode flag is and can only be used for the salmon index step, if I understand correctly, but it doesn't help with salmon quant (see also here) on the STAR aligned BAM file.

Command used and terminal output

$ nextflow run nf-core/rnaseq -revision 3.8.1 -profile docker  --aligner star_salmon --pseudo_aligner salmon  --gencode --save_reference --gtf /data/references/gencode.v41.annotation.gtf.gz  --fasta /data/references/GRCh38.primary_assembly.genome.fa.gz --transcript_fasta /data/references/gencode.v41.transcripts.fa.gz --input $(pwd)/samplesheet.csv  --outdir /data/processed/ -w /data/scratch -bg

Salmon quant fails with


2022-08-31 13:24:02.095] [jointLog] [critical] Transcript ENST00000555544.1 appeared in the BAM header, but was not in the provided FASTA file
...
[2022-08-31 13:24:02.095] [jointLog] [critical] Please provide a reference FASTA file that includes all targets present in the BAM header

This transcript and all other are part of the transcripts.fa, but in the Gencode typical format:

$ zgrep ENST00000554281.1 /data/references/gencode.v41.transcripts.fa.gz
>ENST00000554281.1|ENSG00000185100.11|OTTHUMG00000170818.3|OTTHUMT00000410534.1|ADSS1-205|ADSS1|567|retained_intron|



### Relevant files

_No response_

### System information

- version 22.04.5 build 5708
-  cloud instance
- executor: local
- container engine: docker
- Amazon Linux 2
- nf-core/rnaseq revision 3.8.1

The text was updated successfully, but these errors were encountered:

drpatelh · 2022-09-02T11:11:45Z

Hi @andreas-wilm ! Thanks for reporting. I have personally never used a GENCODE reference. Do you have an idea as to what we should change in the pipeline to deal with this error?

We should only be using the --gencode flag in the indexing step and not in the quantifcation step at the moment:

rnaseq/conf/modules.config

Lines 89 to 90 in 89bf536

    
           withName: 'SALMON_INDEX' { 
        
               ext.args   = params.gencode ? '--gencode' : ''

tamuanand · 2022-09-07T01:09:40Z

Hi @andreas-wilm

Did you try posting this on salmon GitHub - https://github.com/COMBINE-lab/salmon/issues/

Probably @rob-p might have an answer?

tamuanand · 2022-09-07T01:19:36Z

@drpatelh

Probably what @rob-p mentions in the issue linked by @andreas-wilm could be tried?

Do you have an idea as to what we should change in the pipeline to deal with this error?

However, if your BAM file already
contains the stripped transcript names 
(i.e. if the BAM file header has the names
 without everything past the initial |),
 then I believe you can use the 
following command to have 
salmon do the same to the fasta file
on the fly, so that the names match.

salmon-1.5.1_linux_x86_64/bin/salmon \
quant --ont -p 4 -t <(awk '{ if ($0 ~ "^>") { split($0,a,"|"); print a[1] } else { print $0 } }'  \
Genome_files/gencode.vM24.transcripts.fa) -l U -a Documents/Day2_03_DRS_pass.bam \
-o Documents/counts/Day2_03_DRS_pass

andreas-wilm · 2022-09-12T08:18:24Z

Hi @drpatelh and @tamuanand, thanks for looking into this. Removing everything after the pipe from the gencode.v41.transcripts.fa file (with a simple cut -f 1 -d '|') worked. The awk command above does something similar on the fly using a fifo. I guess this might work as a drop-in replacement when --gencode was provided.

Please note: I did this after the run failed, on the staged files and restarted with -resume, i.e. I haven't tested yet, whether doing this from the start fixes the problem, but it should.

Pancreas-Pratik · 2022-09-18T14:05:34Z

Hey I can confirm the above done by @andreas-wilm works when doing the cut -f 1 -d '|' to the gencode.transcripts.fa from the start of the pipeline, rather than after -resume. The pipeline began and completed successfully, (without any errors or warnings). I used the mouse M30 from Gencode here.

As a side note, which was important for me, I was also able to import the salmon outputs using tximeta (https://github.com/mikelove/tximeta) successfully simply using se <- tximeta(coldata). I didn't have to use makeLinkedTxome(), at all. tximeta was able to match/identify the "index_seq_hash": in meta_info.json from the salmon output of nf-core/rnaseq to the ones in it's pre-made hastable. This was a success here.

The issue I am having now which is probably for https://github.com/Bioconductor/GenomicFeatures and/or https://github.com/mikelove/tximeta is after inputting se <- tximeta(coldata), which is pretty much this warning here: https://support.bioconductor.org/p/127869/ I will try to figure it out. I think I just have to specify somehow what columns to read from in the Gencode GTF (through I think, maybe, GenomicFeatures::makeTxDbFromGFF()) that, I think, tximeta uses.

Regarding the issue of the OP, though, below is what I did which worked (Now, I guess, what to change, and where, in nf-core/rnaseq is the question?):

Hi @drpatelh and @tamuanand, thanks for looking into this. Removing everything after the pipe from the gencode.v41.transcripts.fa file (with a simple cut -f 1 -d '|') worked. The awk command above does something similar on the fly using a fifo. I guess this might work as a drop-in replacement when --gencode was provided.

Please note: I did this after the run failed, on the staged files and restarted with -resume, i.e. I haven't tested yet, whether doing this from the start fixes the problem, but it should.

The following worked (confirming @andreas-wilm above #864 (comment)):

Print working directory:

bash-4.2$ pwd
/data/references

List files in current working directory:

bash-4.2$ ls
gencode.vM30.annotation.gtf.gz	gencode.vM30.transcripts.fa.gz	GRCm39.primary_assembly.genome.fa.gz

What do the transcript names look like in gencode.vM30.transcripts.fa.gz:

bash-4.2$ zcat gencode.vM30.transcripts.fa.gz | grep ENS | head
>ENSMUST00000193812.2|ENSMUSG00000102693.2|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|
>ENSMUST00000082908.3|ENSMUSG00000064842.3|-|-|Gm26206-201|Gm26206|110|snRNA|
>ENSMUST00000162897.2|ENSMUSG00000051951.6|OTTMUSG00000026353.2|OTTMUST00000086625.1|Xkr4-203|Xkr4|4153|processed_transcript|
>ENSMUST00000159265.2|ENSMUSG00000051951.6|OTTMUSG00000026353.2|OTTMUST00000086624.1|Xkr4-202|Xkr4|2989|processed_transcript|
>ENSMUST00000070533.5|ENSMUSG00000051951.6|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-201|Xkr4|3634|protein_coding|
>ENSMUST00000192857.2|ENSMUSG00000102851.2|OTTMUSG00000049958.1|OTTMUST00000127143.1|Gm18956-201|Gm18956|480|processed_pseudogene|
>ENSMUST00000195335.2|ENSMUSG00000103377.2|OTTMUSG00000049960.1|OTTMUST00000127145.1|Gm37180-201|Gm37180|2819|TEC|
>ENSMUST00000192336.2|ENSMUSG00000104017.2|OTTMUSG00000049961.1|OTTMUST00000127146.1|Gm37363-201|Gm37363|2233|TEC|
>ENSMUST00000194099.2|ENSMUSG00000103025.2|OTTMUSG00000049930.1|OTTMUST00000127101.1|Gm37686-201|Gm37686|2309|TEC|
>ENSMUST00000161581.2|ENSMUSG00000089699.2|OTTMUSG00000026352.1|OTTMUST00000065165.1|Gm1992-201|Gm1992|250|lncRNA|

What would they look like without everything after the first |

bash-4.2$ zcat gencode.vM30.transcripts.fa.gz | cut -d "|" -f1 | grep ENS | head
>ENSMUST00000193812.2
>ENSMUST00000082908.3
>ENSMUST00000162897.2
>ENSMUST00000159265.2
>ENSMUST00000070533.5
>ENSMUST00000192857.2
>ENSMUST00000195335.2
>ENSMUST00000192336.2
>ENSMUST00000194099.2
>ENSMUST00000161581.2

Above looks okay.

Let's do it and save the output to a file:

bash-4.2$ zcat gencode.vM30.transcripts.fa.gz | cut -d "|" -f1 > gencode.vM30.transcripts_fixed.fa

Let's double check to see if what we did, was what we wanted:

bash-4.2$ cat gencode.vM30.transcripts_fixed.fa | grep ENS | head
>ENSMUST00000193812.2
>ENSMUST00000082908.3
>ENSMUST00000162897.2
>ENSMUST00000159265.2
>ENSMUST00000070533.5
>ENSMUST00000192857.2
>ENSMUST00000195335.2
>ENSMUST00000192336.2
>ENSMUST00000194099.2
>ENSMUST00000161581.2

Above looks good.

(This might not be necessary, but...) Let's gzip the "fixed" transcripts.

bash-4.2$ gzip gencode.vM30.transcripts_fixed.fa

And list the directory again:

bash-4.2$ ls
gencode.vM30.annotation.gtf.gz	gencode.vM30.transcripts.fa.gz	gencode.vM30.transcripts_fixed.fa.gz  GRCm39.primary_assembly.genome.fa.gz

Print working directory:

bash-4.2$ pwd
/data/references

And use the full directory paths in nf-core/rnaseq (making sure to use the one that was just "fixed": gencode.vM30.transcripts_fixed.fa.gz)

bash-4.2$ find "$(pwd)"
/data/references
/data/references/GRCm39.primary_assembly.genome.fa.gz
/data/references/gencode.vM30.transcripts_fixed.fa.gz
/data/references/gencode.vM30.annotation.gtf.gz
/data/references/gencode.vM30.transcripts.fa.gz

So the flag directories in the nf-core/rnaseq should look like (or using these within the nf-params.json)

--gtf /data/references/gencode.vM30.annotation.gtf.gz
--fasta /data/references/GRCm39.primary_assembly.genome.fa.gz
--transcript_fasta /data/references/gencode.vM30.transcripts_fixed.fa.gz

This worked! I stopped getting the same error that @andreas-wilm was getting in the OP here.

Fix #853 and #864

drpatelh · 2022-09-30T10:24:39Z

Fixed in #875

andreas-wilm added the bug Something isn't working label Sep 1, 2022

drpatelh added this to the 3.9 milestone Sep 25, 2022

drpatelh changed the title ~~Salmon quantification faile~~ Pre-process transcripts fasta when using --gencode Sep 30, 2022

drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue Sep 30, 2022

Fix nf-core#864

3ecb22f

drpatelh mentioned this issue Sep 30, 2022

Fix #853 and #864 #875

Merged

drpatelh added a commit that referenced this issue Sep 30, 2022

Merge pull request #875 from drpatelh/updates

3778221

Fix #853 and #864

drpatelh closed this as completed Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-process transcripts fasta when using --gencode #864

Pre-process transcripts fasta when using --gencode #864

andreas-wilm commented Sep 1, 2022

drpatelh commented Sep 2, 2022

tamuanand commented Sep 7, 2022

tamuanand commented Sep 7, 2022

andreas-wilm commented Sep 12, 2022

Pancreas-Pratik commented Sep 18, 2022

drpatelh commented Sep 30, 2022

Pre-process transcripts fasta when using --gencode #864

Pre-process transcripts fasta when using --gencode #864

Comments

andreas-wilm commented Sep 1, 2022

Description of the bug

Command used and terminal output

drpatelh commented Sep 2, 2022

tamuanand commented Sep 7, 2022

tamuanand commented Sep 7, 2022

andreas-wilm commented Sep 12, 2022

Pancreas-Pratik commented Sep 18, 2022

The following worked (confirming @andreas-wilm above #864 (comment)):

drpatelh commented Sep 30, 2022