A Simple Procedure for Combining FTDNA Variant Compare Files
One of the common themes among project administrators and analysts is a desire to easily compare multiple VCFs in a spreadsheet format. Many have created custom scripts to perform this functionality. The concept is simple read the tab-separated files into a collection and group the reads for each VCF by location. Not everyone has the time to maintain or ability to create these utilities.
Combining the VCFs
The GATK suite of utilities mentioned several times before in this blog also comes with a tool for that. CombineVariants allows a researcher to align the calls in multiple VCF files to begin creating a matrix. The kits are arranged after the ninth column in the familiar fashion. Reference reads yield a 0 genotype. Alternate alleles yield a 0/1 or 1 genotype when actually covered. Reads omitted from one of the samples simply use a ‘.’ or ‘./.’ symbol.
Unfortunately, FTDNA is rather incomplete in how the the VCF header is populated. The REJECTED filter is quite common, but the criteria are not qualified. Defining the meaning of the filter values is a common practice to allow analysts to make educated decisions around including these calls or not. By default GATK considers this to be an error in FTDNA’s application of the file standard. Fortunately, there is an –unsafe argument that can be used to ignore the error.
java -jar ~/Genomics/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar \ -T CombineVariants \ -R ~/Genomics/Reference/hg19/hg19.fasta \ -V Sample1_BigY_RawData_20160117/variants.vcf \ -V Sample2_BigY_RawData_20160712/variants.vcf \ -o output.vcf \ -genotypeMergeOptions UNIQUIFY \ --unsafe
Simplifying the Table
Once all of the VCFs have been assembled in this manner it’s a simple matter of using VariantsToTable. This manipulation tool allows you to specify which components of the VCF to write to a table. Consult the documentation for all the options, but a typical run would look similar to the following.
java -jar ~/Genomics/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar \ -T VariantsToTable \ -R ~/Genomics/Reference/hg19/hg19.fasta \ -V output.vcf \ -F CHROM -F POS -F REF -F ALT \ -GF GT \ -o out.txt
The command produces a table with the chromosome, start location, reference allele, alternate alleles, and the allele present in each of the samples.
CHROM POS REF ALT aa827c54-4bfa-4a65-9d42-f74002917ac5.variant.GT cc85f1c9-ed2e-4ea1-8b35-698e9517d7a7.variant2.GT chrY 2650265 A . A A chrY 2650371 G . G G chrY 2650701 G . G G chrY 2650853 G . G G
The unfortunate effect is the samples are named with their test unique identifier and not their more natural kit #. These could be overcome using Excel’s vlookup() or similar function from a real database.
There are a good number of other tools included, which can improve your quality of life. If you notice that INDELs are inconsistently formatted between tests, LeftAlignAndTrimVariants will provide the minimal representation. Named variants can be annotated using VariantAnnotator. This tool typically uses NCBI’s dbSNP as the known variant input. That does not stop one from using an export of ybrowse.org‘s GFF export and creating their own reference VCF. There are further tools for filtering and selecting that one can use as well.