We introduce aggregate MIcrobiome COVerage (micov), a bioinformatic tool that efficiently computes precise, optionally-aggregated, genomic coverage positions across numerous metagenomes and arbitrary sample types. Micov offers three key advantages over conventional tools: rapid sample type-specific cumulative coverage calculations, identification of mobile or polymorphic genetic elements, and detection of strain heterogeneity through coverage variations.
The primary input mapping structure for micov is per-sample SAM/BAM or BED (3-column). These data are then consolidated into Parquet files to utilize pushdown filters.
We recommend creating a separate conda environment, and installing into that.
$ pip install micov
To install the most up-to-date version of micov
$ conda create -n micov python=3.12
$ conda install -q --yes -n micov -c conda-forge --file ci/conda_requirements.txt
$ conda activate micov
$ git clone https://github.com/biocore/micov.git
$ cd micov
$ pip install -e .
See below for examples of running micov
on SAM files.
First, activate the Conda environment where micov
is installed:
conda activate micov
Next, we will process SAM files to extract covered positions. Note: If you have
coverages.tgz
coverage files from Qitta, please go to step 4. micov
accepts
headerless SAM/BAM files, and writes out BED-like files which describe the
observed start and stop positions on the references in the SAM data.
If your input files contain headers, remove them using samtools
before running micov:
samtools view -S input.sam > output.sam
Similarly, if your input files are in BAM format, convert them to SAM format using samtools
:
samtools view input.bam > output.sam
Next, compress the SAM data into BED coverge files. The samtools
command above
can be piped into micov
to compress the SAM data into BED-like files if
desired, but for simplicity, we will demonstrate use from SAM. In writing, we
asssume the name of the SAM file corresponds to a sample name. The subsequent
code expects the BED files to have either a .cov
or .cov.gz
extension.
mkdir -p "./example/coverages"
for file in ./example/samfiles/*.sam.xz; do
sample_id=$(basename "$file" .sam.xz)
echo "Processing $file..."
# Run micov compress
xzcat $file | micov compress | gzip > "./example/coverages/${sample_id}.cov.gz"
done
After extracting coverage data, consolidate the .cov
files into Parquet
representations. This requires a length mapping file (length.tsv
), which
maps genome IDs to their corresponding genome lengths. An example length file
can be found in ./example/metadata/length.tsv
. If this file is not available,
it can for example be generated using seqkit
:
seqkit fx2tab --length --name --header-line foo.fasta > length.tsv
Now, consolidate the coverage files. On read, micov
will interpret the non-extension
portion of a filename as the sample ID. For example, given foo/bar/baz.cov.gz
, the
sample ID will be baz
.
micov nonqiita-to-parquet \
--pattern "example/coverages/*.cov.gz" \
--output example/parquet/example \
--lengths example/metadata/length.tsv
micov
provides functionality to convert Qiita-formatted coverage data into Parquet format as well.
mkdir -p "./example/parquet"
# note: multiple coverage files can be specified by repeating the --qiita-coverages argument
micov qiita-to-parquet \
--qiita-coverages "./example/consolidate/consolidated.tgz" \
--output "./example/parquet/example" \
--lengths "./example/metadata/length.tsv"
A series of plots can be constructed guided by metadata. Specifically, micov
produces the following:
- Non-cumulative coverage curves for each genome in the feature metadata.
- Cumulative coverage curves for each genome in the feature metadata. These accumulation data are supported by K-S tests written to the output directory.
- Scaled and unscaled position plots for each genome in the feature metadata.
Categorical metadata can be used to group samples; sample-metadata
is
required. The genomes to examine can optionally be constrained using
features-to-keep
. Specific start and stop regions of genomes can also be
specified within the features-to-keep
but limited to a single region per
genome currently.
micov
expects the first column of a sample metadata file to be the sample ID
under the header sample_id
. Similarly, the first column of a feature metadata
file should be the feature ID under the header genome_id
.
The --output
parameter specified a prefix for the output files.
Optionally, Monte Carlo curves can be produced for the cumulative plots by
specifying --monte
. There are two Monte Carlo options: unfocused
and
focused
. The unfocused
option will select samples at random with any
coverage data, while the focused
option will randomly select samples with
nonzero coverage of the current genome. Both options select independent of
sample metadata, and will select the max number of samples observed in a sample
group.
mkdir -p "./example/plots/per_sample_groups"
micov per-sample-group \
--parquet-coverage "./example/parquet/example" \
--sample-metadata "./example/metadata/sample_metadata.txt" \
--sample-metadata-column "dog" \
--features-to-keep "./example/metadata/feature_metadata.txt" \
--output "./example/plots/per_sample_groups/example" \
--plot
Existing .SAM/.BAM can be converted into coverage percentages by specifying length data at compression:
$ xzcat some_data.sam.xz | micov compress --length length.tsv > coverages.tsv
Multiple coverage files for the same sample can be aggregated into a single file:
$ zcat run1/sample1.cov.gz run2/sample1.cov.gz | micov compress | gzip > combined/sample1.cov.gz