UNDR ROVER

UNDR ROVER - A fast and accurate variant caller for targeted DNA sequencing.

Authors:

Bernard J Pope (2,3)
Tú Nguyen-Dumont (1)
Daniel J Park (1)
Roger Li (2)
Edmund Lau (2)
Peter Georgeson (2)

Genetic Epidemiology Laboratory, Department of Pathology, The University of Melbourne.
Victorian Life Sciences Computation Initiative (VLSCI).
Department of Computing and Information Systems, The University of Melbourne.

License:

The BSD 3-Clause License

Requirements

Python 2.7
PyVCF
Pyfaidx
Biopython
SciPy
A reference human genome sequence in a single FASTA file
A file describing the amplicon target coordinates
A file describing the PCR primers used for the specified coordinates
Paired-end DNA sequences where the paired reads significantly overlap, such as the result of the Hi-Plex targeted sequencing system.

Description

UNDR ROVER enables the user to call variants from decompressed FASTQ files directly without having to go through a mapping step.

Reads are organised into blocks, which are initialised from information about known primer sequences. For each block, we record the primer's DNA sequence, coordinates and reference insert sequence. Once the reads have been assigned to blocks (by matching the bases in their primer region to a known primer), we call variants for each block one at a time.

The approach UNDR ROVER takes when processing reads to call variants is to initially assume that all reads contain only single nucleotide variants. If we detect two single nucleotide variants in a row during this variant calling, UNDR ROVER immediatly ceases this step and instead does a full gapped alignment of the read against the reference insert sequence to detect possible insertions and deletions.

Only variants detected in both reads of a read-pair are considered relevant. This allows UNDR ROVER to filter out many sequencing artifacts and produce highly accurate results. However, it does depend on the underlying sequences to have substantially overlapping paired reads, such as the result of the Hi-Plex targeted sequencing system. User-defined thresholds determine the minimum number and proportion of read-pairs a variant must be observed to be considered a 'PASS'. Coverage is also reported for each block to allow for regions which require additional screening to be easily identified.

Installation

We recommend using virtualenv for installing UNDR ROVER in a sandboxed environment instead of a global install. The instructions below assume you are working in a directory where you can install the software.

UNDR ROVER depends on SciPy, which can be challenging to install using pip. We recommend using your operating system package manner to install SciPy before installing UNDR ROVER. The --system-site-packages option to virtualenv tells pip to use the site-installed version of SciPy.

virtualenv --system-site-packages undr_rover_dev
source undr_rover_dev/bin/activate
pip install git+https://github.com/bjpop/undr_rover

You can test that the installation worked correctly by asking undr_rover for its version number:

undr_rover --version
undr_rover 2.0.0

The human genome reference

UNDR ROVER requires a human reference genome in a single FASTA format. Your choice of which version to use will depend on other downstream analyses that you might do. HG19 works well if you don't know which one to choose.

You will need to index the reference genome to produce a .fai file. You can do this with samtools:

samtools faidx reference.fasta

Worked example

The source code of UNDR ROVER comes with a small example dataset for testing purposes.

Obtain a copy of the UNDR ROVER source code:

git clone https://github.com/bjpop/undr_rover

The example data set is in the directory undr_rover/example within the source code.

cd undr_rover/example/

As no mapping is undertaken by UNDR ROVER, this file is required to determine the canonical insert sequence. Note that you will also need the .fai index file.

Create a file specifying target regions.

The example directory contains one called example_coords.txt.

The file should be tab-separated, and contain information in 5 columns. The first column will contain chromosome numbers, while the second and third columns specify the starting and ending coordinates of the target regions. The fourth and fifth columns will name the forward and reverse primers of each target region.

All coordinates are 1-based.

An example of such a file is shown below:

chr11   31811401    31811498    PAX6_1_F1   PAX6_1_R1
chr11   31811499    31811597    PAX6_1_F2   PAX6_1_R2
chr11   31812210    31812307    PAX6_2_F1   PAX6_2_R1
chr11   31812308    31812406    PAX6_2_F2   PAX6_2_R2
chr11   31812399    31812496    PAX6_2_F3   PAX6_2_R3
chr11   31814945    31815042    PAX6_3_F1   PAX6_3_R1
chr11   31815043    31815140    PAX6_3_F2   PAX6_3_R2
chr11   31815147    31815244    PAX6_4_F1   PAX6_4_R1
chr11   31815242    31815339    PAX6_4_F2   PAX6_4_R2
chr11   31815340    31815437    PAX6_4_F3   PAX6_4_R3

Specify the name of this file using the --primer_coords command line parameter.

Create a file specifying the primer sequences.

The example directory contains one called example_primers.txt

UNDR ROVER requires the primer sequences for each primer in order to correctly match reads to target regions. The file should be tab-separated and simply contain 2 columns, with the first containing the primer name and the second containing the primer sequence.

An example of such a file is shown below:

PAX6_1_F1   TAACAGCCATTTTTCTTTCTTTCCT       
PAX6_1_R1   GTGAACCTGATATGTCTCAATACT        
PAX6_1_F2   TTTTTTTTTTTACTGTAATCTTGGCC      
PAX6_1_R2   TGTCTGTTTCTCAAAGGGAATATC        
PAX6_2_F1   GAGATTCTGTTTGGGTAAACTTCT        
PAX6_2_R1   CTACACCCCCCCACATATGCA       
PAX6_2_F2   TGGCTGACTGTTCATGTGTGTC      
PAX6_2_R2   TTTGCCTCTCTCCTCACAGCC       
PAX6_2_F3   AGTATGAGGAGGTCTGGCTGG       
PAX6_2_R3   GGCATCTTTAATGATCAGACTTGT

Specify the name of this file using the --primer_sequences command line parameter.

Run UNDR ROVER from the command line

The example directory contains a file called command.sh which contains an illustrative command to test out the program.

The command is:

undr_rover --primer_coords example_coords.txt \
           --primer_sequences example_primers.txt \
           --reference reference.fasta \
           --out example.vcf \
           --genotype \
           example_R1.fastq example_R2.fastq

The output of the above command will be a VCF file called example.vcf, its contents will look like so:

##fileformat=VCFv4.2
##fileDate=20160324
##source=UNDR ROVER
##INFO=<ID=Sample,Number=1,Type=String,Description="Sample Name">
##INFO=<ID=NV,Number=1,Type=Float,Description="Number of read pairs with variant">
##INFO=<ID=NP,Number=1,Type=Float,Description="Number of read pairs at POS">
##INFO=<ID=PCT,Number=1,Type=Float,Description="Percentage of read pairs at POS with variant">
##INFO=<ID=GT,Number=1,Type=String,Description="Most likely genotype of the variant">
##INFO=<ID=GTP,Number=1,Type=Float,Description="Probability of the genotype">
##FILTER=<ID=at,Description="Variant does not appear in at least 2 read pairs">
##FILTER=<ID=pt,Description="Variant does not appear in at least 5.0% of read pairs for the given region">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr7    152357834   .   T   G   .   at;pt Sample=example_R1.fastq;NV=1;NP=155;PCT=0.65%;GT=TT;GTP=0.659
chr7    152357819   .   T   C   .   PASS    Sample=example_R1.fastq;NV=76;NP=155;PCT=49.03%;GT=TC;GTP=0.957

The VCF file contains a header region which describes the contents of the file, plus one line for each reported variant. Note that not all reported variants will be true. UNDR ROVER reports all sequence differences that it sees in the input sample which appear on both reads in a pair. It then applies stringency filters to those reported variants and annotates them in the FILTER column with PASS if they pass all stringency filters, or with annotations describing which filters they fail (at and pt). In the example above there are two reported variants, although only one of them passes the default filters (PASS). The first reported variant only appears in 1 read pair (NV=1) out of 155 read pairs which aligned to the same position. Therefore it fails two stringency tests: at Variant does not appear in at least 2 read pairs, and pt Variant does not appear in at least 5.0% of read pairs for the given region. These stringency filters are adjustable on the command line. Furthermore we can see that the reference base was T and the alternative (putative variant) base was G, however the genotype called for this position is homozygous for the reference base (GT=TT). This says that UNDR ROVER believes that the evidence in the reads for this position suggests that the reported variant G is an error. UNDR ROVER is quite confident in its reported genotype for this position (GTP=0.659). The second reported variant passes all the stringency tests and agrees with the highly confident computed genotype (GT=TC, GTP=0.957). Out of 155 read pairs for this position, the variant was observed in about half (NV=76).

UNDR ROVER can accept multiple samples on the input command line. You must specify two files for each sample, corresponding to the first and second read in each pair. The read pair files for each sample must be specified consecutively. UNDR ROVER will call the variants in each sample individually, it does not perform joint genotyping.

If you have FASTQ.gz files, they will need to be decompressed.

The variants called will be written to the specified output file. This file will report called variants in accordance with the VCF v4.2 specification. Coverage files containing the number of read-pairs which mapped to each region will be output in coverage_files/sample1.coverage and coverage_files/sample2.coverage. A log file describing the actions taken by the program will be stored in log_file.

Command line arguments

Use the -h command line argument to get a listing of all UNDR ROVER's options:

undr_rover -h

The above command produces the following output. A more detailed explanation of the options is provided below.

Many of the command line arguments have default values.

usage: undr_rover [-h] [--version] --primer_coords PRIMER_COORDS
                  --primer_sequences FILE
                  [--primer_prefix_size PRIMER_PREFIX_SIZE]
                  [--kmer_length KMER_LENGTH]
                  [--kmer_threshold KMER_THRESHOLD]
                  [--primer_bases PRIMER_BASES] [--proportionthresh N]
                  [--absthresh N] [--qualthresh N] [--overlap OVERLAP]
                  [--max_variants N] --reference FILE [--id_info FILE] --out
                  FILE [--log FILE] [--coverdir COVERDIR] [--fast]
                  [--snvthresh N] [--genotype] [--ploidy {1,2}]
                  [--error ERROR]
                  fastqs [fastqs ...]

Find variants from fastqs via a mapping-free approach.

positional arguments:
  fastqs                Fastq files containing reads.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --primer_coords PRIMER_COORDS
                        Primer coordinates in TSV format.
  --primer_sequences FILE
                        Primer base sequences as determined by a primer
                        generating program.
  --primer_prefix_size PRIMER_PREFIX_SIZE
                        Size of primer key for blocks in dictionary.
  --kmer_length KMER_LENGTH
                        Length of k-mer to check after the primer sequence.
                        Defaults to 30.
  --kmer_threshold KMER_THRESHOLD
                        Number of single nucleotide variants deemed acceptable
                        in kmer.
  --primer_bases PRIMER_BASES
                        Number of bases from primer region to use in gapped
                        alignment.Helps with variant calling near the edges of
                        a block. Defaults to 5.
  --proportionthresh N  Keep variants which appear in this proportion of the
                        read pairs. For a given target region, and bin
                        otherwise. Defaults to 0.05.
  --absthresh N         Only keep variants which appear in at least this many
                        read pairs. Defaults to 2.
  --qualthresh N        Minimum base quality score (phred).
  --overlap OVERLAP     Minimum proportion of block which must be overlapped
                        by a read. Defaults to 0.9.
  --max_variants N      Ignore reads with greater than this many variants
                        observed. Defaults to 20.
  --reference FILE      Reference sequences in Fasta format.
  --id_info FILE        File containing rs ID information in VCF format.
  --out FILE            Name of output file containing called variants.
  --log FILE            Logs progress in specified file, defaults to stdout.
  --coverdir COVERDIR   Directory to write coverage files, defaults to current
                        working directory.
  --fast                Use gapped alignment less often, leading to faster run
                        time.
  --snvthresh N         Distance between two single nucleotide variants before
                        going to a gapped alignment.
  --genotype            Compute genotypes for SNVs. Defaults to False.
  --ploidy {1,2}        Ploidy for genotyping 1 = haploid, 2 = diploid.
                        Defaults to 2.
  --error ERROR         Expected base read error rate. Defaults to 0.002.

Explanation of command line options

`-h`

Print a help message and exit.

`--version`

Print the version number of rover and exit.

`--primer_coords COORDS`

Required

A list of primers with their expected coordinates. TSV format with the following data:

chromosome  start   end     forward_primer  reverse_primer

`--primer_sequences SEQ`

Required

A list of primers with their expected base sequences. TSV format with the following data:

primer_name sequence

`--primer_prefix_size N`

Optional. Defaults to 20.

Determines the size of the prefix used for the dictionary which stores primers. Will most likely be set to a value equal to the minimum length of a primer.

`--kmer_length N`

Optional. Defaults to 30.

Determines the length of k-mer to use in k-mer test. In this test, if a certain number of SNVs is found in k-mers from both reads then the read pair is discarded.

`--kmer_threshold N`

Optional. Defaults to 2.

The number of SNVs considered acceptable in the k-mer region of a read. If there are more than N SNVs in both k-mers of a read pair, then the read will be discarded.

`--primer_bases N`

Optional. Defaults to 3.

When there are repeated bases or repeated sequences of bases near the start of a block, detecting variants becomes slightly harder as some variants won't be apparent. By default, UNDR ROVER will use 3 additional bases from the end of the primer region during variant calling. This will help to detect those variants.

Increasing the value of this parameter will mean more bases from the end of the primer region are used, leading to slightly improved accuracy at the cost of slower run time.

`--proportionthresh N`

Optional. Defaults to 0.05.

Only keep variants which appear in this proportion of the read-pairs for a given target region, and bin otherwise. A variant must appear in both reads of a pair to be counted. The proportion is calculated as follows:

N = number of pairs containing this variant in both reads
T = number of read-pairs overlapping the target region

proportion = N/T

Note: variants must pass BOTH the proportionthresh and absthresh thresholds to be kept. If they fail either test then they are discarded.

That is to say:

if N/T  >= proportionthresh and N >= absthresh:
    keep the variant
else:
    discard the variant

`--absthresh N`

Optional. Defaults to 2 read-pairs.

Only keep variants which appear in at least this many read-pairs for a given target region.

See comments above about proportionthresh.

`--qualthresh N`

Optional.

Minimum phred quality score for bases appearing in SNVs and insertions. If this argument is set Rover will only consider SNVs and insertions where all DNA bases in those variants have a quality score greater than or equal to the argument. For example, setting this argument to 35 will cause Rover to discard any SNVs or insertions containing any bases with a score below 35. If the argument is not set then Rover will not consider quality scores in its decision to keep or discard a variant.

`--overlap OVERLAP`

Optional. Defaults to 0.9.

Minimum fraction overlap of read to block region. 0.5 means at least half of a block must be overlapped by a read for the read to be considered for that region. 1.0 would mean the entire block must be overlapped by the read.

`--max_variants N`

Optional. Defaults to 25.

The maximum number of variants which can be observed for a single read before it is considered meaningless and discarded for the purposes of variant calling.

`--reference FILE`

Required.

Reference sequences in FASTA format. UNDR ROVER extracts insert sequences for each block from the reference and compares this with the read sequences to call variants.

`--id_info FILE`

Optional.

UNDR ROVER will find the rs numbers in DBSNP and show them in the output VCF file if it can find the relevant number in DBSNP, which is a vcf format file containing rs numbers for known variants in .vcf.gz format with an accompanying .tbi file (created by tabix).

`--fast`

Optional. Defaults to False.

If set, gapped alignment will only be used when SNVs are detected within snvthresh bases of each other. Leads to a significant decrease in the run time of UNDR ROVER.

`--snvthresh N`

Optional. Defaults to 1.

If --fast is set, gapped alignment will be performed on reads which contain SNVs within N bases of each other. The default setting of 1 allows for the fastest run-time and should provide a sufficient level of accuracy in the majority of cases.

`--coverdir COVERDIR`

Optional. Defaults to current working directory.

Directory to write the coverage files. If the directory does not exist Rover will not try to create it. You must create the directory yourself.

The format of the coverage files is a TSV with:

chr     block_start     block_end       num_pairs

`fastqs [fastqs ...]`

One or more pairs of FASTQ files containing reads for which variant calling will be attempted. FASTQ.gz files will need to decompressed prior to use with UNDR ROVER.

`--genotype`

Optional. Defaults to False.

If this command line argument is specified UNDR ROVER will compute the most likely genotype for every SNV reported in the output VCF file. The most likely genotype is computed from the pileup of bases which align to the same position as the SNV. By default the underlying ploidy of the sample is assumed to be diploid, but this can be overridden with the --ploidy command line argument. Currently UNDR ROVER only supports genotyping for diploid and haploid cells. When calling the genotype UNDR ROVER considers all possible genotypes and picks the one which minimises a statistical test called the G Test. It also computes a probability of that genotype which indicates how confident it is in the call. Genotyping can add time to the computation, so it is not done by default. Currently UNDR ROVER does not do genotyping for INDELs.

`--ploidy {1,2}`

Optional. Defaults to 2. Only relevant if --genotype is also given as command line argument.

If --genotype is given as a command line argument then the underlying ploidy of the sample can be set to either 1 (haploid) or 2 (diploid). UNDR ROVER assumes the same ploidy for all positions when calling genotyping. If you want to perform genotyping on samples with different ploidy (for example on human male sex chromosomes) then you will need to run UNDR ROVER on those reads separately.

`--error`

Optional. Defaults to 0.002.

UNDR ROVER assumes a constant per-base error rate for its input reads. This error rate is used in the the optional genotyping of SNVs.