Skip to content

Commit

Permalink
Merge branch 'main' of github.com:davetang/learning_vcf_file
Browse files Browse the repository at this point in the history
  • Loading branch information
davetang committed Mar 25, 2024
2 parents 01696e1 + 2b1d002 commit e557a43
Show file tree
Hide file tree
Showing 8 changed files with 404 additions and 50 deletions.
312 changes: 266 additions & 46 deletions README.md

Large diffs are not rendered by default.

80 changes: 80 additions & 0 deletions csq/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
## README

Problem:

> However, current predictors analyse variants as isolated events, which can
lead to incorrect predictions when adjacent variants alter the same codon, or
when a frame-shifting indel is followed by a frame-restoring indel.

BCFtools/csq is a fast program for haplotype-aware consequence calling which
can take into account known phase.

There are several popular existing programs for variant annotation including:

1. Ensembl Variant Effect Predictor (VEP)
2. SnpEff
3. ANNOVAR

but they do not take phasing into account.

## BCFtools/csq

`bcftools csq` requires a phased VCF, a GFF3 file with gene predictions, and a
reference FASTA file.

```
About: Haplotype-aware consequence caller.
Usage: bcftools csq [OPTIONS] in.vcf
Required options:
-f, --fasta-ref FILE Reference file in fasta format
-g, --gff-annot FILE GFF3 annotation file
CSQ options:
-B, --trim-protein-seq INT Abbreviate protein-changing predictions to max INT aminoacids
-c, --custom-tag STRING Use this tag instead of the default BCSQ
-l, --local-csq Localized predictions, consider only one VCF record at a time
-n, --ncsq INT Maximum number of per-haplotype consequences to consider for each site [15]
-p, --phase a|m|r|R|s How to handle unphased heterozygous genotypes: [r]
a: take GTs as is, create haplotypes regardless of phase (0/1 -> 0|1)
m: merge *all* GTs into a single haplotype (0/1 -> 1, 1/2 -> 1)
r: require phased GTs, throw an error on unphased het GTs
R: create non-reference haplotypes if possible (0/1 -> 1|1, 1/2 -> 1|2)
s: skip unphased hets
Options:
-e, --exclude EXPR Exclude sites for which the expression is true
--force Run even if some sanity checks fail
-i, --include EXPR Select sites for which the expression is true
--no-version Do not append version and command line to the header
-o, --output FILE Write output to a file [standard output]
-O, --output-type b|u|z|v|t[0-9] b: compressed BCF, u: uncompressed BCF, z: compressed VCF
v: uncompressed VCF, t: plain tab-delimited text output, 0-9: compression level [v]
-r, --regions REGION Restrict to comma-separated list of regions
-R, --regions-file FILE Restrict to regions listed in a file
--regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
-s, --samples -|LIST Samples to include or "-" to apply all variants and ignore samples
-S, --samples-file FILE Samples to include
-t, --targets REGION Similar to -r but streams rather than index-jumps
-T, --targets-file FILE Similar to -R but streams rather than index-jumps
--targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]
--threads INT Use multithreading with <int> worker threads [0]
-v, --verbose INT Verbosity level 0-2 [1]
Example:
bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf
# GFF3 annotation files can be downloaded from Ensembl. e.g. for human:
ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/
ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/
```

The program begins by parsing gene predictions in the GFF3 file, then streams
through the VCF file using a fast region lookup at each site to find overlaps
with regions of supported genomic types (exons, CDS, UTRs or general
transcripts). For more details read the paper (see [Further
reading](#further-reading).

## Further reading

* [BCFtools/csq: haplotype-aware variant consequences
](https://academic.oup.com/bioinformatics/article/33/13/2037/3000373)
Binary file added eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz
Binary file not shown.
Binary file not shown.
Binary file added eg/S1.haplotypecaller.filtered_VEP.ann.vcf.gz
Binary file not shown.
Binary file not shown.
59 changes: 57 additions & 2 deletions readme.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ It is also relatively straightforward to compile on Linux (if your system has al
```bash
dir=$HOME/tools

ver=1.15
ver=1.18
for tool in htslib bcftools samtools; do
check=${tool}
if [[ ${tool} == htslib ]]; then
Expand Down Expand Up @@ -474,7 +474,62 @@ bcftools view eg/aln.bt.vcf.gz | perl -nle 'BEGIN { srand(1984) } if (/^#/){ pri

### Split an annotation field

The [split-vep](https://samtools.github.io/bcftools/howtos/plugin.split-vep.html) plugin can be used to split a structured field. However, `split-vep` was written to work with VCF files created by `bcftools csq` or [VEP](https://github.com/Ensembl/ensembl-vep). It is possible to get it working with [SnpEff](https://github.com/pcingola/SnpEff), another popular variant annotation tool. with some modifications to the VCF header.
The [split-vep](https://samtools.github.io/bcftools/howtos/plugin.split-vep.html) plugin can be used to split a structured field. `split-vep` was written to work with VCF files created by [VEP](https://github.com/Ensembl/ensembl-vep) or `bcftools csq`.

```{bash engine.opts='-l'}
bcftools +split-vep -h || true
```

#### VEP

An [example VCF file](https://github.com/davetang/vcf_example) that was annotated with VEP is available as `eg/S1.haplotypecaller.filtered_VEP.ann.vcf.gz`. To list the annotation fields use `-l`.

```{bash engine.opts='-l'}
bcftools +split-vep -l eg/S1.haplotypecaller.filtered_VEP.ann.vcf.gz | head
```

Use `-f` to print the wanted fields in your own specified format; variants without consequences are excluded.


```{bash engine.opts='-l'}
bcftools +split-vep -f '%CHROM:%POS,%ID,%Consequence\n' eg/S1.haplotypecaller.filtered_VEP.ann.vcf.gz | head
```

Limit output to missense or more severe variants.

```{bash engine.opts='-l'}
bcftools +split-vep -f '%CHROM:%POS,%ID,%Consequence\n' -s worst:missense+ eg/S1.haplotypecaller.filtered_VEP.ann.vcf.gz | head
```

#### BCFtools csq

An [example VCF file](https://github.com/davetang/vcf_example) that was annotated with BCFtools csq is available as `eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz`. The tag added by `csq` is `INFO/BCSQ`, so we need to provide this to split-vep. To list the annotation fields use `-l`.

```{bash engine.opts='-l'}
bcftools +split-vep -a BCSQ -l eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz
```

Use `-f` to print the wanted fields in your own specified format; variants without consequences are excluded.

```{bash engine.opts='-l'}
bcftools +split-vep -a BCSQ -f '%CHROM:%POS,%ID,%Consequence\n' eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz | head
```

The `-d` or `--duplicate` is useful to output annotations per transcript/allele on a new line.

```{bash engine.opts='-l'}
bcftools +split-vep -a BCSQ -f '%transcript,%Consequence\n' eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz | head
```

Use `-d` to split.

```{bash engine.opts='-l'}
bcftools +split-vep -a BCSQ -d -f '%transcript,%Consequence\n' eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz | head
```

#### SnpEff

It is possible to use the split-vep plugin with [SnpEff](https://github.com/pcingola/SnpEff), another popular variant annotation tool with some modifications to the VCF header.

SnpEff provides annotations with the `ANN` tag.

Expand Down
3 changes: 1 addition & 2 deletions script/tools_for_readme.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ if [[ ! -d ${install_path} ]]; then
fi

for tool in htslib bcftools; do
ver=1.16
ver=1.18
check=${tool}
if [[ ${tool} == htslib ]]; then
check=bgzip
Expand Down Expand Up @@ -60,4 +60,3 @@ fi

>&2 echo Done
exit 0

0 comments on commit e557a43

Please sign in to comment.