Broad Institute released a new major version of their Swiss Army knife of bio-informatics software known as GATK tools on January 9, 2018. This means it is time to overhaul the Haplogroup-R.org variant discovery workflow to use the latest suite. The key change as far as workflow is the introduction of GenomicsDB, which vastly improves the performance of joint calling.
Software Tools Needed
The requirements for the workflow include a POSIX operating system, a Java 8 run-time environment, and access to typical development tools in some cases. The software packages included in the workflow’s processing include:
GATK 184.108.40.206 (Download here.)
BWA 0.7.17 (Available here.)
BBmap (Download here.)
The software also requires a bundle of reference files to align raw sequencing data. Which reference you choose GRCh37 or GRCh38 is a matter of preference still, but most of the direct to consumer providers have migrated to GRCh38. Previous versions of this workflow have used the 1000 Genome’s bundle, since it provided the ability to use their aligned samples in the analysis set without fuss. However, those starting today may wish to use the bundle provided by the Broad Institute. The following command will clone their hg38 reference to current directory:
wget -m ftp://firstname.lastname@example.org/bundle/hg38
The scripts assume 16GB of RAM is available on the hardware. BWA MEM only requires 5.5 GB for a single thread, so one could run those processes on lesser hardware and trade a much longer processing time. The Java memory options come from GATK’s sample scripts. It is possible less than 16GB is needed for a Big Y BAM, but no effort was spent on tuning these parameters. WGS samples typically need 50 to 120 GB of disc space per intermediate step in the scripts. Having 1TB of free space for a temporary directory is recommended.
What is new?
One of the major changes in GATK’s Best Practice documentation is in the pre-processing of FASTQ. Their recommendation in tutorial #6483 is to produce a clean BAM & BAI to use as input into a variant discovery chain. The process involves generating an unaligned BAM file and marking the sequencing adapters. The adapters are trimmed and the remaining reads are streamed into BWA MEM to locate on the genome. Finally, the alignment meta data is merged back with the original unaligned BAM. The resulting file will have all the original read data and properly adjusted alignment meta data. Read the linked tutorial for much more detail.
In older versions of this work flow a large number of gVCFs were batched together for genotyping. This process could take over 24 hours with the full 1,000 gathered samples. While most project administrators will not be dealing with batches that large, it’s reassuring to see the headroom gained with this optimization.
Script 1a: Convert FASTQ to unaligned BAM
# USAGE: sh fastq_to_sam.sh <fastq1> <fastq2> <sample_name> <read_group> <platform_unit> gatk=~/Genomics/gatk-220.127.116.11/gatk $gatk --java-options "-Xmx8G" FastqToSam \ -FASTQ=$1 \ -FASTQ2=$2 \ -OUTPUT=$3.unaligned.bam \ -READ_GROUP_NAME=$4 \ -SAMPLE_NAME=$3 \ -LIBRARY_NAME=$3 \ -PLATFORM_UNIT=$5 \ -PLATFORM=illumina
Script 1b: Revert BAM to unaligned BAM
# USAGE: sh revert_bam.sh <sample name> # Assumes GATK is on the path. Based on https://gatkforums.broadinstitute.org/gatk/discussion/6484#latest%23top gatk RevertSam \ -I=$1.bam \ -O=$1.unaligned.bam \ -SANITIZE=true \ -MAX_DISCARD_FRACTION=0.005 \ -ATTRIBUTE_TO_CLEAR=XT \ -ATTRIBUTE_TO_CLEAR=XN \ -ATTRIBUTE_TO_CLEAR=AS \ -ATTRIBUTE_TO_CLEAR=OC \ -ATTRIBUTE_TO_CLEAR=OP \ -SORT_ORDER=queryname \ -RESTORE_ORIGINAL_QUALITIES=true \ -REMOVE_DUPLICATE_INFORMATION=true \ -REMOVE_ALIGNMENT_INFORMATION=true
Script 2: Prepare the Clean BAM for an Illumina Sample
# USAGE: sh create_clean_bam.sh <sample name> # Based on https://software.broadinstitute.org/gatk/documentation/article.php?id=6483 # CONFIG VARIABLES: Update to match environment gatk=~/Genomics/gatk-18.104.22.168/gatk reference=~/Genomics/Reference/GRCh38_full_analysis_set_plus_decoy_hla.fa tmp_dir=/Volumes/External/tmp # Mark the Illumina adapters (if present. The sequencing lab should have removed them # prior to delivering the results.) $gatk --java-options "-Xmx8G" MarkIlluminaAdapters \ -I=$1.unmapped.bam \ -O=$1.markilluminaadapters.bam \ -M=$1.markilluminaadapters.metrics.txt \ -TMP_DIR=$tmp_dir # Perform a piped operation that trims the Illumina adapters from the reads, aligns them # to the target reference, and creates a clean BAM for variant discovery. $gatk --java-options "-Xmx8G" SamToFastq \ -I=$1.markilluminaadapters.bam \ -FASTQ=/dev/stdout \ -CLIPPING_ATTRIBUTE=XT -CLIPPING_ACTION=2 -INTERLEAVE=true -NON_PF=true \ -TMP_DIR=$tmp_dir | \ ~/Genomics/bwa/bwa mem -M -t 7 -p $reference /dev/stdin | \ $gatk --java-options "-Xmx16G" MergeBamAlignment \ -ALIGNED_BAM=/dev/stdin \ -UNMAPPED_BAM=$1.unmapped.bam \ -OUTPUT=$1.bwa.clean.bam \ -R=$reference -CREATE_INDEX=true -ADD_MATE_CIGAR=true \ -CLIP_ADAPTERS=false -CLIP_OVERLAPPING_READS=true \ -INCLUDE_SECONDARY_ALIGNMENTS=true -MAX_INSERTIONS_OR_DELETIONS=-1 \ -PRIMARY_ALIGNMENT_STRATEGY=MostDistant -ATTRIBUTES_TO_RETAIN=XS \ -TMP_DIR=$tmp_dir
Once the second script has been applied, the unmapped and adapter marked BAMs are no longer necessary and can be removed. You can always get the unmapped file back by applying the revert_bam.sh script.
Script 3: Recalibrate the read bases and generate a gVCF for the cleaned BAM
# USAGE: sh prepare_gvcf.sh <sample name> # CONFIG VARIABLES: Update to match environmentgatk=~/Genomics/gatk-22.214.171.124/gatk reference=~/Genomics/Reference/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa known=~/Genomics/Reference/GRCh38/Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf.gz $gatk --java-options "-Xmx4G" \ MarkDuplicates -I=$1.bwa.clean.bam -O=$1.dedup.bam -METRICS_FILE=metrics.txt $gatk --java-options "-Xmx8G" \ BaseRecalibrator -R $reference -O recal_data.table -I $1.dedup.bam \ --known-sites $known $gatk --java-options "-Xmx8G" ApplyBQSR \ -R $reference \ -I $1.dedup.bam \ --bqsr-recal-file recal_data.table \ -O $1.recal.bam $gatk --java-options "-Xmx8G" HaplotypeCaller -R $reference \ -L chrY \ -L chrY_KI270740v1_random \ -O $1.b38.g.vcf.gz \ -I $1.recal.bam \ -ERC GVCF
All scripts are available at GitHubGist for convenience and collaboration.
This post outlines the base of haplogroup-r.org‘s new variant discovery workflow using the GATK toolkit. We use BBmap to ensure empty or misnamed reads from the vendor do not adversely impact the alignment or variant calling processes, since GATK doesn’t have equivalent tools well documented at this current time.
Joint genotyping of the gVCFs is not discussed at this time as the GenomicsDB alterations are still in testing. In the mean time you can reference the previous version of this workflow to use batches of gVCFs.
Prior versions of this workflow were able to create a callable loci report. GATK4 lacks the walker tool that generated the BED file. If this is a desired artifact in your analysis, it is recommended to run the GATK3 tool on the recalibrated BAM generated in Script 3. The GATK team has stated a new version of CallableLoci is in the works, so the Gist for Script 3 will be updated at that time.
Future articles will look at evaluating alternative pipelines for alignment and variant calling. Aligners such as Arioc and BarraCUDA offer optimizations to utilize GPU acceleration in the Smith-Waterman and Burrows-Wheeler algorithms. Freebayes offers a Bayesian variant detection implementation written in C++. The model eliminates GATK’s dependence on BaseRecalibrator preprocessing.