Recently I wrote about a new Excel workbook that I maintain with generally R1b-P312 and downstream samples. The file is currently shared via Dropbox for any researcher interested in the clades. One of the questions I’ve received is how this document might be useful to those who are included.
A simple example would be to perform a private variant analysis from the reference genome. Variants including SNPs and INDELs are represented by a green background cell with a positive number. The number is simply the number of times the position has been read in sequencing. For comparison purposes the matrix also includes ancestral reads in the other tests. These have a rose colored background with a negative value.
In some instances there is ambiguity in what was tested. Ambiguous reads are listed with an amber background and includes the number of ancestral and derived reads for each A, C, G, or T value found. Reads coded this way represent sequencing or alignment errors. Interpreting them is beyond the scope of this post.
Step 1: Finding all variants
Step 2: Naive private variant selection
An interesting experiment is then to find someone who is at the same terminal position of the tree. Kits B5163 and 12283 are both R-FGC29071. So if we filter 11283’s rows with only those of a rose colored background (negative calls), we then see B5163 has 32 variants not shared. Looking at the rows though we see a lot of green in other groups. These rows are either upstream then back mutated in 11283, or have independently arisen in the other branches. Either way they are less than ideal.
Step 3: Refining private variant selection
To improve on this we could have filtered for the same goal using the ‘positive’ column. So we remove the color filter from 11283, and scroll way over to the right to find the positives column and select only 1. This is a ‘private’ mutation filter. We now see B5163 has 56 private mutations not shared by anyone else to date.
Step 4: Final private variant analysis
There is an interesting observation here though. 11283 has zeros at most of these locations. This is because McCarty is tested using Big Y and I am tested with Full Y aka Y Elite 1.0. Full Y covers 60% of the Y chromosome and Big Y only about 40%. We don’t know if more of those private SNPs are really shared. So we restore the filter for a rose background on 11283. This shows there are only 11 SNPs we can say for sure are private.
That concludes how you can use the R-P312 Combo BAM Matrix to perform your own private variant analysis with Microsoft Excel. The steps are using simple auto-filters. These narrow down the list over 50,000 reported locations to just a small handful in most cases. Future articles will offer suggestions on what additional types of information can be learned using the data set.