UNDR ROVER - A fast and accurate variant caller for targeted DNA sequencing.
Authors:
- Bernard J Pope (2,3)
- TĂș Nguyen-Dumont (1)
- Daniel J Park (1)
- Roger Li (2)
- Edmund Lau (2)
- Peter Georgeson (2)
- Genetic Epidemiology Laboratory, Department of Pathology, The University of Melbourne.
- Victorian Life Sciences Computation Initiative (VLSCI).
- Department of Computing and Information Systems, The University of Melbourne.
License:
Requirements
- Python 2.7
- PyVCF
- Pyfaidx
- Biopython
- SciPy
- A reference human genome sequence in a single FASTA file
- A file describing the amplicon target coordinates
- A file describing the PCR primers used for the specified coordinates
- Paired-end DNA sequences where the paired reads significantly overlap, such as the result of the Hi-Plex targeted sequencing system.
Description
UNDR ROVER enables the user to call variants from decompressed FASTQ files directly without having to go through a mapping step.
Reads are organised into blocks, which are initialised from information about known primer sequences. For each block, we record the primer's DNA sequence, coordinates and reference insert sequence. Once the reads have been assigned to blocks (by matching the bases in their primer region to a known primer), we call variants for each block one at a time.
The approach UNDR ROVER takes when processing reads to call variants is to initially assume that all reads contain only single nucleotide variants. If we detect two single nucleotide variants in a row during this variant calling, UNDR ROVER immediatly ceases this step and instead does a full gapped alignment of the read against the reference insert sequence to detect possible insertions and deletions.
Only variants detected in both reads of a read-pair are considered relevant. This allows UNDR ROVER to filter out many sequencing artifacts and produce highly accurate results. However, it does depend on the underlying sequences to have substantially overlapping paired reads, such as the result of the Hi-Plex targeted sequencing system. User-defined thresholds determine the minimum number and proportion of read-pairs a variant must be observed to be considered a 'PASS'. Coverage is also reported for each block to allow for regions which require additional screening to be easily identified.
Installation
We recommend using virtualenv for installing UNDR ROVER in a sandboxed environment instead of a global install. The instructions below assume you are working in a directory where you can install the software.
UNDR ROVER depends on SciPy, which can be challenging to install using pip. We recommend using your operating system package manner to install SciPy before installing UNDR ROVER. The --system-site-packages
option to virtualenv tells pip to use the site-installed version of SciPy.
virtualenv --system-site-packages undr_rover_dev
source undr_rover_dev/bin/activate
pip install git+https://github.com/bjpop/undr_rover
You can test that the installation worked correctly by asking undr_rover for its version number:
undr_rover --version
undr_rover 2.0.0
The human genome reference
UNDR ROVER requires a human reference genome in a single FASTA format. Your choice of which version to use will depend on other downstream analyses that you might do. HG19 works well if you don't know which one to choose.
You will need to index the reference genome to produce a .fai
file. You can do this with samtools:
samtools faidx reference.fasta
Worked example
The source code of UNDR ROVER comes with a small example dataset for testing purposes.
Obtain a copy of the UNDR ROVER source code:
git clone https://github.com/bjpop/undr_rover
The example data set is in the directory undr_rover/example
within the source code.
cd undr_rover/example/
As no mapping is undertaken by UNDR ROVER, this file is required to determine the canonical insert sequence. Note that you will also need the .fai index file.
Create a file specifying target regions.
The example directory contains one called example_coords.txt
.
The file should be tab-separated, and contain information in 5 columns. The first column will contain chromosome numbers, while the second and third columns specify the starting and ending coordinates of the target regions. The fourth and fifth columns will name the forward and reverse primers of each target region.
All coordinates are 1-based.
An example of such a file is shown below:
chr11 31811401 31811498 PAX6_1_F1 PAX6_1_R1
chr11 31811499 31811597 PAX6_1_F2 PAX6_1_R2
chr11 31812210 31812307 PAX6_2_F1 PAX6_2_R1
chr11 31812308 31812406 PAX6_2_F2 PAX6_2_R2
chr11 31812399 31812496 PAX6_2_F3 PAX6_2_R3
chr11 31814945 31815042 PAX6_3_F1 PAX6_3_R1
chr11 31815043 31815140 PAX6_3_F2 PAX6_3_R2
chr11 31815147 31815244 PAX6_4_F1 PAX6_4_R1
chr11 31815242 31815339 PAX6_4_F2 PAX6_4_R2
chr11 31815340 31815437 PAX6_4_F3 PAX6_4_R3
Specify the name of this file using the --primer_coords
command line parameter.
Create a file specifying the primer sequences.
The example directory contains one called example_primers.txt
UNDR ROVER requires the primer sequences for each primer in order to correctly match reads to target regions. The file should be tab-separated and simply contain 2 columns, with the first containing the primer name and the second containing the primer sequence.
An example of such a file is shown below:
PAX6_1_F1 TAACAGCCATTTTTCTTTCTTTCCT
PAX6_1_R1 GTGAACCTGATATGTCTCAATACT
PAX6_1_F2 TTTTTTTTTTTACTGTAATCTTGGCC
PAX6_1_R2 TGTCTGTTTCTCAAAGGGAATATC
PAX6_2_F1 GAGATTCTGTTTGGGTAAACTTCT
PAX6_2_R1 CTACACCCCCCCACATATGCA
PAX6_2_F2 TGGCTGACTGTTCATGTGTGTC
PAX6_2_R2 TTTGCCTCTCTCCTCACAGCC
PAX6_2_F3 AGTATGAGGAGGTCTGGCTGG
PAX6_2_R3 GGCATCTTTAATGATCAGACTTGT
Specify the name of this file using the --primer_sequences
command line parameter.
Run UNDR ROVER from the command line
The example directory contains a file called command.sh
which contains an illustrative command to test out the program.
The command is:
undr_rover --primer_coords example_coords.txt \
--primer_sequences example_primers.txt \
--reference reference.fasta \
--out example.vcf \
--genotype \
example_R1.fastq example_R2.fastq
The output of the above command will be a VCF file called example.vcf
, its contents will look like so:
##fileformat=VCFv4.2
##fileDate=20160324
##source=UNDR ROVER
##INFO=<ID=Sample,Number=1,Type=String,Description="Sample Name">
##INFO=<ID=NV,Number=1,Type=Float,Description="Number of read pairs with variant">
##INFO=<ID=NP,Number=1,Type=Float,Description="Number of read pairs at POS">
##INFO=<ID=PCT,Number=1,Type=Float,Description="Percentage of read pairs at POS with variant">
##INFO=<ID=GT,Number=1,Type=String,Description="Most likely genotype of the variant">
##INFO=<ID=GTP,Number=1,Type=Float,Description="Probability of the genotype">
##FILTER=<ID=at,Description="Variant does not appear in at least 2 read pairs">
##FILTER=<ID=pt,Description="Variant does not appear in at least 5.0% of read pairs for the given region">
#CHROM POS ID REF ALT QUAL FILTER INFO
chr7 152357834 . T G . at;pt Sample=example_R1.fastq;NV=1;NP=155;PCT=0.65%;GT=TT;GTP=0.659
chr7 152357819 . T C . PASS Sample=example_R1.fastq;NV=76;NP=155;PCT=49.03%;GT=TC;GTP=0.957
The VCF file contains a header region which describes the contents of the file, plus one line for each reported variant. Note that not all reported variants will be true. UNDR ROVER reports all sequence differences that it sees in the input sample which appear on both reads in a pair. It then applies stringency filters to those reported variants and annotates them in the FILTER column with PASS if they pass all stringency filters, or with annotations describing which filters they fail (at
and pt
). In the example above there are two reported variants, although only one of them passes the default filters (PASS). The first reported variant only appears in 1 read pair (NV=1) out of 155 read pairs which aligned to the same position. Therefore it fails two stringency tests: at
Variant does not appear in at least 2 read pairs, and pt
Variant does not appear in at least 5.0% of read pairs for the given region. These stringency filters are adjustable on the command line. Furthermore we can see that the reference base was T and the alternative (putative variant) base was G, however the genotype called for this position is homozygous for the reference base (GT=TT). This says that UNDR ROVER believes that the evidence in the reads for this position suggests that the reported variant G is an error. UNDR ROVER is quite confident in its reported genotype for this position (GTP=0.659). The second reported variant passes all the stringency tests and agrees with the highly confident computed genotype (GT=TC, GTP=0.957). Out of 155 read pairs for this position, the variant was observed in about half (NV=76).
UNDR ROVER can accept multiple samples on the input command line. You must specify two files for each sample, corresponding to the first and second read in each pair. The read pair files for each sample must be specified consecutively. UNDR ROVER will call the variants in each sample individually, it does not perform joint genotyping.
If you have FASTQ.gz files, they will need to be decompressed.
The variants called will be written to the specified output file. This file will report called variants in accordance with the VCF v4.2 specification. Coverage files containing the number of read-pairs which mapped to each region will be output in coverage_files/sample1.coverage and coverage_files/sample2.coverage. A log file describing the actions taken by the program will be stored in log_file.
Command line arguments
Use the -h
command line argument to get a listing of all UNDR ROVER's options:
undr_rover -h
The above command produces the following output. A more detailed explanation of the options is provided below.
Many of the command line arguments have default values.
usage: undr_rover [-h] [--version] --primer_coords PRIMER_COORDS
--primer_sequences FILE
[--primer_prefix_size PRIMER_PREFIX_SIZE]
[--kmer_length KMER_LENGTH]
[--kmer_threshold KMER_THRESHOLD]
[--primer_bases PRIMER_BASES] [--proportionthresh N]
[--absthresh N] [--qualthresh N] [--overlap OVERLAP]
[--max_variants N] --reference FILE [--id_info FILE] --out
FILE [--log FILE] [--coverdir COVERDIR] [--fast]
[--snvthresh N] [--genotype] [--ploidy {1,2}]
[--error ERROR]
fastqs [fastqs ...]
Find variants from fastqs via a mapping-free approach.
positional arguments:
fastqs Fastq files containing reads.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--primer_coords PRIMER_COORDS
Primer coordinates in TSV format.
--primer_sequences FILE
Primer base sequences as determined by a primer
generating program.
--primer_prefix_size PRIMER_PREFIX_SIZE
Size of primer key for blocks in dictionary.
--kmer_length KMER_LENGTH
Length of k-mer to check after the primer sequence.
Defaults to 30.
--kmer_threshold KMER_THRESHOLD
Number of single nucleotide variants deemed acceptable
in kmer.
--primer_bases PRIMER_BASES
Number of bases from primer region to use in gapped
alignment.Helps with variant calling near the edges of
a block. Defaults to 5.
--proportionthresh N Keep variants which appear in this proportion of the
read pairs. For a given target region, and bin
otherwise. Defaults to 0.05.
--absthresh N Only keep variants which appear in at least this many
read pairs. Defaults to 2.
--qualthresh N Minimum base quality score (phred).
--overlap OVERLAP Minimum proportion of block which must be overlapped
by a read. Defaults to 0.9.
--max_variants N Ignore reads with greater than this many variants
observed. Defaults to 20.
--reference FILE Reference sequences in Fasta format.
--id_info FILE File containing rs ID information in VCF format.
--out FILE Name of output file containing called variants.
--log FILE Logs progress in specified file, defaults to stdout.
--coverdir COVERDIR Directory to write coverage files, defaults to current
working directory.
--fast Use gapped alignment less often, leading to faster run
time.
--snvthresh N Distance between two single nucleotide variants before
going to a gapped alignment.
--genotype Compute genotypes for SNVs. Defaults to False.
--ploidy {1,2} Ploidy for genotyping 1 = haploid, 2 = diploid.
Defaults to 2.
--error ERROR Expected base read error rate. Defaults to 0.002.
Explanation of command line options
-h
Print a help message and exit.
--version
Print the version number of rover and exit.
--primer_coords COORDS
Required
A list of primers with their expected coordinates. TSV format with the following data:
chromosome start end forward_primer reverse_primer
--primer_sequences SEQ
Required
A list of primers with their expected base sequences. TSV format with the following data:
primer_name sequence
--primer_prefix_size N
Optional. Defaults to 20.
Determines the size of the prefix used for the dictionary which stores primers. Will most likely be set to a value equal to the minimum length of a primer.
--kmer_length N
Optional. Defaults to 30.
Determines the length of k-mer to use in k-mer test. In this test, if a certain number of SNVs is found in k-mers from both reads then the read pair is discarded.
--kmer_threshold N
Optional. Defaults to 2.
The number of SNVs considered acceptable in the k-mer region of a read. If there are more than N SNVs in both k-mers of a read pair, then the read will be discarded.
--primer_bases N
Optional. Defaults to 3.
When there are repeated bases or repeated sequences of bases near the start of a block, detecting variants becomes slightly harder as some variants won't be apparent. By default, UNDR ROVER will use 3 additional bases from the end of the primer region during variant calling. This will help to detect those variants.
Increasing the value of this parameter will mean more bases from the end of the primer region are used, leading to slightly improved accuracy at the cost of slower run time.
--proportionthresh N
Optional. Defaults to 0.05.
Only keep variants which appear in this proportion of the read-pairs for a given target region, and bin otherwise. A variant must appear in both reads of a pair to be counted. The proportion is calculated as follows:
- N = number of pairs containing this variant in both reads
- T = number of read-pairs overlapping the target region
proportion = N/T
Note: variants must pass BOTH the proportionthresh and absthresh thresholds to be kept. If they fail either test then they are discarded.
That is to say:
if N/T >= proportionthresh and N >= absthresh:
keep the variant
else:
discard the variant
--absthresh N
Optional. Defaults to 2 read-pairs.
Only keep variants which appear in at least this many read-pairs for a given target region.
See comments above about proportionthresh
.
--qualthresh N
Optional.
Minimum phred quality score for bases appearing in SNVs and insertions. If this argument is set Rover will only consider SNVs and insertions where all DNA bases in those variants have a quality score greater than or equal to the argument. For example, setting this argument to 35 will cause Rover to discard any SNVs or insertions containing any bases with a score below 35. If the argument is not set then Rover will not consider quality scores in its decision to keep or discard a variant.
--overlap OVERLAP
Optional. Defaults to 0.9.
Minimum fraction overlap of read to block region. 0.5 means at least half of a block must be overlapped by a read for the read to be considered for that region. 1.0 would mean the entire block must be overlapped by the read.
--max_variants N
Optional. Defaults to 25.
The maximum number of variants which can be observed for a single read before it is considered meaningless and discarded for the purposes of variant calling.
--reference FILE
Required.
Reference sequences in FASTA format. UNDR ROVER extracts insert sequences for each block from the reference and compares this with the read sequences to call variants.
--id_info FILE
Optional.
UNDR ROVER will find the rs numbers in DBSNP and show them in the output VCF file if it can find the relevant number in DBSNP, which is a vcf format file containing rs numbers for known variants in .vcf.gz format with an accompanying .tbi file (created by tabix).
--fast
Optional. Defaults to False.
If set, gapped alignment will only be used when SNVs are detected within snvthresh bases of each other. Leads to a significant decrease in the run time of UNDR ROVER.
--snvthresh N
Optional. Defaults to 1.
If --fast
is set, gapped alignment will be performed on reads
which contain SNVs within N bases of each other. The default setting
of 1 allows for the fastest run-time and should provide a sufficient
level of accuracy in the majority of cases.
--coverdir COVERDIR
Optional. Defaults to current working directory.
Directory to write the coverage files. If the directory does not exist Rover will not try to create it. You must create the directory yourself.
The format of the coverage files is a TSV with:
chr block_start block_end num_pairs
fastqs [fastqs ...]
One or more pairs of FASTQ files containing reads for which variant calling will be attempted. FASTQ.gz files will need to decompressed prior to use with UNDR ROVER.
--genotype
Optional. Defaults to False.
If this command line argument is specified UNDR ROVER will compute the most likely genotype for every SNV reported in the output VCF file. The most likely genotype is computed from the pileup of bases which align to the same position as the SNV. By default the underlying ploidy of the sample is assumed to be diploid, but this can be overridden with the --ploidy
command line argument. Currently UNDR ROVER only supports genotyping for diploid and haploid cells. When calling the genotype UNDR ROVER considers all possible genotypes and picks the one which minimises a statistical test called the G Test. It also computes a probability of that genotype which indicates how confident it is in the call. Genotyping can add time to the computation, so it is not done by default. Currently UNDR ROVER does not do genotyping for INDELs.
--ploidy {1,2}
Optional. Defaults to 2. Only relevant if --genotype
is also given as command line argument.
If --genotype
is given as a command line argument then the underlying ploidy of the sample can be set to either 1 (haploid) or 2 (diploid). UNDR ROVER assumes the same ploidy for all positions when calling genotyping. If you want to perform genotyping on samples with different ploidy (for example on human male sex chromosomes) then you will need to run UNDR ROVER on those reads separately.
--error
Optional. Defaults to 0.002.
UNDR ROVER assumes a constant per-base error rate for its input reads. This error rate is used in the the optional genotyping of SNVs.