-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Time-consuming to run RSeQC geneBody_coverage2.py on bigwig #81
Comments
More details:
|
Thanks, Jun for looking into this. Maybe it's a good point to think about using something else, QualiMap can do genebody coverage, too: |
Thank you for the suggestion. I will look at it. |
MultiQC also doesn't run with this output. That can be fixed, but I wonder if we should look into using a different tool to do this step (and others..?). |
Parts could be replaced at least by QualiMap: |
Hm, so I guess we could just go back and introduce the subsampling procedure again to make the genebody issues go away. The bigwig based process seems to take ages - I'm running a pipeline run on some published data for ~2 days and another project manager here at QBiC has similar issues with the extremely long runtime of I'd also be happy to implement the QualiMap step and test it here locally before pushing anything on here in a separate branch - any concerns about this? |
Reverting would be a quick fix, but it feels dirty. The BigWig outputs are useful and I feel that we should be able to generate the gene body plots with this pretty easily. I'd quite like to have a proper look into this / profile the job and try to see what's going wrong. It's a really short script so I'm curious as to where the time is going. Alternatively, we can indeed switch wholesale to another package. I'd prefer not to mix up QC packages too much as it can get pretty complicated and the software requirements are tougher. |
I did quite a lot tests for this and feel like it might be a nextflow problem. |
Aha! @jun-wan you are confirming my suspicions! Now we just need to figure out what's happening 😆 |
Can you be a bit more explicit what you mean with rchar > 10 TB? Then I could also have a look at my issues... happy to help here - we're quite affected by this unfortunately :-( |
It's I/O counter: chars read, written in |
To narrow things down - same input dataset on same pipeline (1.0) release on same computing system I had a power outage today, so the run didn't finish. Running again with |
Update: Nextflow |
Gah, why the hell is that process doing some much disk access? Can anyone see a reason for this in the code? What happens if we try running the job outside of nextflow - does it take a comparable amount of time? |
I'll take one of the BAMs and run it on a workstation to test. can't really test it on the cluster atm... Update: Test running! |
I think I'm having the same problem: I'm running the pipeline on Uppmax/rackham, with singularity, using nextflow v 30.1.
Looking at the trace file, the disc io is not so high: rchar is 12.2 MB and vchar is 663.4 KB
|
I should also mention that I get a similar error using the ChIP-seq pipeline https://github.com/nf-core/chipseq, when running BWA. Here I ran on Uppmax/bianca, with singularity and I tried nextflow v 0.31.0 och v 0.30.1 |
CollectRnaSeqMetrics from Picard can do this too if you set It looks like picard is already part of the software requirements but as Phil mentioned it wont be as tidy... |
CollectRnaSeqMetrics could be a nice solution actually if we can't figure this out soon... 👍 |
Yes, thats what I routinely use. You can also feed it ribosomal intervals to look for rRNA contamination (mainly in total RNA preps) - another metric that is useful to have but you need to create a *.list file containing the intervals from the gtf. The parameters like strandedness can also be set. |
I should also mention that the annotation has to be in reflat format as far as I know which might be a deal breaker. |
Hi there, I took the time to read the script and performed some changes, being the main one replacing the python package bx.bbi for pyBigWig for reading the BigWig file. This resulted in a 21X performance increase on the script. I've sent the script to its creator to see if he can make the changes permanent in a new release. In the meantime, I'm attaching the script geneBody_coverage3.txt to this post (the extension I've also noticed that MultiQC is not producing several of the RSeQC plots, like:
On other order of things, I've seen the Please, give it a try and let me know if it works for you. Cheers, |
Sorry, regarding this point, "folder name" is unrelated to why this metric is not being produced. This was part of my ignorance regarding how MultiQC works. However, there are a couple of errors in the module, but I will open a ticket in MultiQC for that. @ewels, do you have a test to check if all the plots for the tools in the pipeline are being properly generated (or generated at all)? |
@santiagorevale - no pipeline tests like this yet, no. It could be nice to add. We've thought about validating outputs, perhaps just grepping the MultiQC log output would be an easy start on that road.. |
Liguo Wang (RSeQC's developer) has released version 3.0 of RSeQC, with my changes on the Gene Body Coverage script. If we updated the package in the Docker image, we should be noticing the improvement in performance. Although I haven't check if all the other scripts are still behaving the same (no reason why they shouldn't, though). |
Hello everyone! Does anyone know if this is still an issue in rnaseq v1.1 with nextflow>=0.32.0? |
Probably still an issue yes, hoping to get to it very soon. |
RSeQC 3.0 does include the modified script by @santiagorevale (fortunately still named geneBody_coverage2.py). However, RSeQC 3.0 is not yet available on bioconda and it does require python version >= 3. Does this cause any troubles or are we lucky enough to not have any python 2 only packages? |
Thanks @alneberg - I've moved this item to a new issue (#148), as this issue has gotten to be very long and difficult to track. Hopefully integrating rseqc will solve the problems on this issue and we can close it. If not, we can keep this issue to track them and perhaps look into migrating to Picard instead. |
@apeltzer has now updated the Would be great if anyone gets a chance to try running the pipeline with some big inputs to see if there's an improvement. Thanks all! Phil |
These were some mouse samples that I just ran for integration testing of the So between 3-8.5hours per sample is a lot IMHO. I'm going to run some human data today, too. |
Human Data: 79 18/f4cc73 88447 genebody_coverage (I16R003f01_01_S5_L003_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.167 2h 14m 41s 2h 1m 14s 100,6% 9395 MB 11 GB 9504 GB 25 MB
76 3f/6c2192 88444 genebody_coverage (I16R003f02_01_S6_L002_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.031 2h 27m 2s 2h 26m 58s 101,6% 9393 MB 11 GB 9321 GB 24 MB
77 47/63af20 88443 genebody_coverage (I16R003f02_01_S6_L004_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:29.988 2h 28m 47s 2h 28m 36s 101,5% 9393 MB 11 GB 9348 GB 24 MB
80 ab/1bee49 88446 genebody_coverage (I16R003f01_01_S5_L002_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.134 3h 28m 38s 3h 28m 31s 100,7% 9397 MB 11 GB 9506 GB 25 MB
78 32/a4c8ba 88449 genebody_coverage (I16R003f02_01_S6_L003_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.227 3h 42m 33s 2h 27m 13s 100,5% 9395 MB 11 GB 9278 GB 24 MB
82 75/c10a0e 88450 genebody_coverage (I16R003f01_01_S5_L001_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.258 4h 17m 13s 2h 2m 31s 101,6% 9397 MB 11 GB 9494 GB 25 MB
75 76/a10aa6 88445 genebody_coverage (I16R003f02_01_S6_L001_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.078 4h 20m 23s 4h 20m 17s 100,3% 9395 MB 11 GB 9272 GB 24 MB
81 85/b42b38 88451 genebody_coverage (I16R003f01_01_S5_L004_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.290 4h 29m 58s 2h 2m 55s 100,9% 940 MB 11 GB 9445 GB 25 MB |
I think its fine now - we should keep an eye on it maybe still... |
Running
geneBody_coverage2.py
on bigwig files instead of the entire BAM as input requires much less memory but is still time-consuming (32hrs for 15M reads,1.1G BAM).The text was updated successfully, but these errors were encountered: