My earlier post detailed several scripts to reprocess vendor supplied BAMs using a methodology based on Broad Institute’s Best Practices. FTDNA Big Y customers attempting to follow this process could have several issues present themselves. Each has a solution, but most result in a small percentage of reads being eliminated. Ideally, FTDNA would release your full set of reads instead of filtering to prevent this.
Here are a list of common issues and the work-around needed to get them through the GATK workflow. This post will be a living document that is updated as new issues are brought to light.
Recipe 1: Fix Broken Read Names
If you see:
INFO 2018-06-15 05:06:19 RevertSam Discarded 10069697 out of 10069697 (100.000%) reads in order to sanitize output.
Then do this:
samtools bam2fq 0629-c0fc0db1-8e0e-4548-aa9c-2098bcfc4e3a.bam | gzip > interleaved_reads.fq.gz
bbmap/repair.sh in=interleaved_reads.fq.gz out1=1.fq out2=2.fq outs=singletons.fq repair
What it does:
The root issue here is that FTDNA has discarded part of the read pair name in their processing. We need to start over from the fundamentals by reverting their BAM to an interleaved FASTQ file. Once we have the raw read data we can use Bbmap’s repair module to reapply the missing /1 and /2 portion of the names. The first two output files are suitable for running into fastq_to_sam.sh. The singletons.fq file contains the unpaired reads caused by FTDNA filtering your BAM to only the chrY regions. BWA MEM will not be able to align them as accurately without their mate, and clean BAMs should not contain them under best practices.
Recipe 2: Fix Broken CIGAR Encoding
If you see:
htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Read name SN7001368:263:HALJ0ADXX:1:2214:15631:92476, Read CIGAR M operator maps off end of reference
OR
htsjdk.samtools.SAMFormatException: SAM validation error: WARNING: Read name SN7001368:184:H9CKDADXX:1:2203:6512:94819, No M or N operator between pair of I operators in CIGAR
Then do this:
samtools bam2fq 3198-2f06198e-5c31-4ac1-aacf-ce7cce5cf646.bam | gzip > interleaved_reads.fq.gz
bbmap/repair.sh in=interleaved_reads.fq.gz out1=1.fq out2=2.fq outs=singletons.fq repair
What it does:
The original generation of Big Y BAMs, while containing all of your sequencing data, had an issue with CIGAR encoding in aengine. The reads are being encoded as falling off the reference or incorrectly tagging substitution and/or indel events. The solution is basically the same as before. Revert to the raw FASTQ files. The repair is likely not necessary but the overhead over simply splitting the interleaved file is minimal. Unfortuantely, GATK doesn’t seem to provide a command line option to simply ignore the error as the mapping information is being discarded by RevertSam anyway.