Preparing datasets for large scale phylogenetic analyses

A SEDA pipeline created in Compi that implements the "Preparing datasets for large scale phylogenetic analyses" SEDA-based protocol. Created using the SEDA-Compi pipelines framework.

This protocol shows how to retrieve and process a large amount of coding sequences of a given gene. The portrayed example concerns GULO, a gene that encodes for the protein that catalyzes the final oxidation step of the Vitamin C biosynthetic pathway in animals (http://doi.org/10.7554/eLife.06369).

Quick-start: running the pipeline with sample data

It is recommended to run this quick-start example to check that everything works fine:

Download this ZIP and decompress it. The path where it is extracted will be referred as "working directory" (/path/to/working_dir).
Move to the working directory and run ./run.sh "$(pwd)". This will run the entire pipeline with eight input files, 4 from RefSeq and 4 from GenBank.

Using real data from NCBI

After running the quick-start, remove the output folder that was created in the working directory.

Then, download the input data as explained here. GenBank data must be placed in input/rename-ncbi_1 and RefSeq data in input/rename-ncbi_2.

Before running the pipeline, you can:

Edit the compi.params to change the batch size of the four "NCBI rename" (rename-ncbi) and "BLAST" (blast) operations. The batch size is the maximum number of files each SEDA command will process at the same time. If no provided, it means that the command will attempt to process all files at the same time. The provided values are appropriate in most cases (workstations with 8-16GB of RAM). For higher values, the amount of RAM memory that SEDA can use must be increased.
Change the amount of RAM memory that SEDA can use by exporting the variable SEDA_JAVA_MEMORY. This must be done before running the run.sh script (e.g. export SEDA_JAVA_MEMORY="-Xmx8G").
Reduce disk usage (at the cost of increasing running time) by creating files params/rename-ncbi_1.cliParams and params/rename-ncbi_2.cliParams with the following contents --output-gzip. This way, the outputs of such operations will be compressed using GZIP and therefore reducing the amount of disk space they require.

And now, the full pipeline can be executed with ./run.sh "$(pwd)".

To run specific tasks an additional parameter can be passed to the run.sh script: ./run.sh "$(pwd)" "--single-task rename-ncb_1" or ./run.sh "$(pwd)" "--until merge_3".

Contributors

^{Made with contrib.rocks.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Preparing datasets for large scale phylogenetic analyses

Quick-start: running the pipeline with sample data

Using real data from NCBI

Contributors

Files

README.md

Latest commit

History

README.md

File metadata and controls

Preparing datasets for large scale phylogenetic analyses

Quick-start: running the pipeline with sample data

Using real data from NCBI

Contributors