Updating a Y-DNA Variant Discovery Workflow
Last fall I shared the Y-DNA Variant Discovery Workflow – Pt. 1, which automated most of the process to apply for a consistent variant detection process in NGS data. There have since been some new versions of the underlying software and changes in Broad Institute’s best practices. This post will share the changes and discuss some of the why the change was necessary.
The scripts will take you into GRCh38 this time around, if you follow the steps exactly. Those who wish to remain on GRCh37 for more convenient comparison with existing data sets should use the references in the older version of this post.
Getting the Tools
GATK 3.6 – Download page
Picard 2.6.0 – Download jar
Samtools – Download page
BWA – Github project
BBMap – Project page
Reference files available from 1000 Genomes. The files are located in technical/reference/GRCh38_reference_genome in their FTP repository. The known dbSNP and Mills_and_1000G indels files are in the other_mapping_resources subdirectory.
Setting Up Your Environment
The scripts are designed to work in a BASH shell on a *nix-like environment. A Java 8 Run-time environment is required for many of the tools. The setup is designed to have a Genomics directory in the user’s home directory. This contains the unzipped version of the tools and a Reference directory containing the 1000 Genomes reference file.
Combining the Steps
gatk=~/Genomics/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar picard=~/Genomics/picard-tools-2.6.0/picard.jar reference=~/Genomics/Reference/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa known=~/Genomics/Reference/GRCh38/Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf snpdb=~/Genomics/Reference/GRCh38/ALL_20141222.dbSNP142_human_GRCh38.snps.vcf prefix_hla_hit=~/Genomics/Reference/GRCh38/resource-GRCh38/hs38DH-extra.fa reference_fasta_file=~/Genomics/Reference/GRCh38/resource-GRCh38/hs38DH.fa.alt # Parse out the actual Kit# from the local naming convention [kit#_surname_origin_testtype.bam] read KIT1 KIT2 KIT3 <<<$(IFS="_"; echo $1) # Shuffle the source BAM so that the alignment process is not colored by the source samtools bamshuf -Ou $2 - | samtools bam2fq - > interleaved_reads.fq # Big Y BAMs released for most of 2016 have had the read-pairs named incorrectly. # BBMap includes a tool which will correct the problems. We have to specify an # unpaired file as well, since the post filtering removes a small percentage of # pairs that were aligned to a different chromosome. ~/Genomics/bbmap/repair.sh -Xmx8g in=interleaved_reads.fq out=fixed.fq outsingle=single.fq # Align the FASTQ using b38. The pipe to bwa-postalt.js is not needed if your # reference does not contain alt-contigs. ~/Genomics/bwa/bwa mem -t 4 -M -p $reference fixed.fq | ~/Applications/k8-0.2.1/k8-darwin ~/Genomics/bwa/bwakit/bwa-postalt.js -p $prefix_hla_hit $reference_fasta_file > aligned_reads.sam # Order the resulting SAM to a BAM. java -jar $picard SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate # Mark the duplicate reads java -jar $picard MarkDuplicates INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt # Create read groups using dummy values, we use the sample id# from the naming convention as this will become visible in the g.vcf java -jar $picard AddOrReplaceReadGroups INPUT=dedup_reads.bam OUTPUT=$1.b38.bam RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=$KIT1 # Index the resulting b38 aligned BAM samtools index $1.b38.bam # Sequencers have systematic technical errors which impact the quality scores. This # process applies machine learning to empirically model the errors. java -Xmx4g \ -jar $gatk \ -T BaseRecalibrator -nt 1 -l INFO -cov ReadGroupCovariate -cov QualityScoreCovariate \ -cov CycleCovariate -cov ContextCovariate -R $reference -o recal_data.table -I $1.b38.bam \ -knownSites $known # Now that the error model has been learned adjust quality scores. Most will be # reduced but some will increase. None of the scores will be changed so much that # the calling step will fail. java -Xmx4g \ -jar $gatk -T PrintReads -l INFO -R $reference -o $1.recalibrated.b38.bam \ -I $1.b38.bam -BQSR recal_data.table --disable_bam_indexing samtools index $1.recalibrated.b38.bam # Call the realigned b38 BAM to gVCF for stage 2. # Note: Remove the -L if variants other than chrY are desired java -Xmx8g -jar $gatk -R $reference -T HaplotypeCaller \ -L chrY \ -L chrY_KI270740v1_random \ --dbsnp $snpdb \ -o $1.b38.g.vcf \ -I $1.recalibrated.b38.bam \ --emitRefConfidence GVCF \ -stand_call_conf 30 \ -stand_emit_conf 0 -mmq 0 # Compute some statistics about how many bases can be called java -jar $gatk \ -T CallableLoci \ -L chrY \ -L chrY_KI270740v1_random \ -R $reference \ -I $1.recalibrated.b38.bam \ -summary table.txt \ -o callable_status.bed # Save some disc space by compressing the VCF and creating an index bgzip $1.b38.g.vcf tabix -p vcf $1.b38.g.vcf.gz rm $1.b38.g.vcf.idx # POST PROCESS - REMOVE THE TEMP FILES rm *.fq rm interleaved_reads.fq.gz rm aligned_reads.sam rm sorted_reads.bam rm dedup_reads.bam rm $1.b38.bam rm $1.b38.bam.bai
Batching the Files
Once your gVCFs have been processed it is advantageous to combine the files into batches of 100 to 200 files. This is accomplished with the CombineGVCFs tool.
gatk=~/Genomics/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar picard=~/Genomics/picard-tools-1.135/picard.jar reference=~/Genomics/Reference/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa known=~/Genomics/Reference/GRCh38/Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf snpdb=~/Genomics/Reference/GRCh38/ALL_20141222.dbSNP142_human_GRCh38.snps.vcf target=Complete.b38.g.vcf java -Xmx4g \ -jar $gatk \ -R $reference \ -T CombineGVCFs \ -o ~/Genomics/work/$target \ -V ./A.b38.g.vcf.gz \ -V ./Q.b38.g.vcf.gz \ -V ./R.b38.g.vcf.gz bgzip ~/Genomics/work/$target tabix -p vcf ~/Genomics/work/$target.gz rm ~/Genomics/work/$target.idx
Discovering the Variants
Once the batches have been created, it’s time to genotype the set with GenotypeGVCFs. This process allows every test to be sampled when a variant is discovered in anyone of the tests. This is useful for generating a call matrix to feed into a parsimony tree algorithm later.
gatk=~/Genomics/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar reference=~/Genomics/Reference/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa java -Xmx4g \ -jar $gatk \ -T GenotypeGVCFs \ -R $reference \ -V $1 \ -o gatksnps.b38.vcf -nt 8 java -Xmx4g \ -jar $gatk \ -T VariantFiltration \ -R $reference \ --filterExpression "DP < 5 " --filterName "LowCoverage" \ --filterExpression "QD < 1.5 " --filterName "LowQD" \ --filterExpression "QUAL < 30.0 && QUAL < 50.0 " --filterName "LowQual" \ --filterExpression "QUAL <= 30.0" --filterName "VeryLowQual" \ --filterExpression "QUAL < 10.0" --filterName "REJECTED" \ --clusterWindowSize 10 \ -o gatksnps.b38.filtered.vcf \ -V gatksnps.b38.vcf
In Closing
The workflow above forms the central processing performed to produce the CSVs on Haplogroup-R’s matrix. While not fully tuned the results are comparable to similar comparisons available. Processing time depends on the type of NGS test you are aligning and calling depends on the total number of samples.
Should you see glaring errors or minor omissions feel free to contact me.