Review: Full Genomes Corp third party analysis of Veritas Genetics raw WGS data

In this post, I will provide my review of Full Genomes Corp‘s service offering third party analysis of raw data produced by Veritas Genetics‘ $999 whole genome sequencing (Veritas myGenome). After I released my raw genome data to the public domain, FGC contacted me and offered to run my WGS data through their BAM processing pipeline at no cost. I naturally accepted and agreed to write a review.

This service from FGC includes three categories of analysis: mtDNA, YDNA, and autosomal ancestry. As of now, I have received my mtDNA and YDNA results; the autosomal analysis takes longer to produce and I will leave it out of scope for this review.

Getting Started

After creating an account on the FGC site, I needed to provide them with access to the BAM file that Veritas Genetics produced. My participation in the Personal Genome Project made this easy as I only had to give them the URL to my BAM file on the PGP public data repository.

A little bit more than two weeks later I received email reporting that I had results ready. When I logged back in to FGC a prominent link provided access to download all of my results in a single zip archive. This zip archive contained a readme file directing me to two PDF documents with further information: one focused on extracting private SNPs from YDNA results and the second describing the individual data files FGC returns, which I will get to below.

Mitochondrial DNA results

I have already had my full mitochondrial DNA sequenced by FamilyTreeDNA, so I did not expect to learn anything new from FGC’s data analysis, which produced two files. The first file contains a list of variants found in my mtDNA with respect to the Yoruba reference sequence by position. The second file contains my full mtDNA sequence in FASTA format.

The FASTA file took me by surprise, as they indicated a heteroplasmic length variant that FamilyTreeDNA had not come across (or had not informed me of) in their Sanger sequencing. FGC found a deletion at position 310, the loss of a T flanked by C repeats on both sides. I do not know if this information will turn out relevant for me, but who knows, I prefer to have it.
[ADDED 20170306: I should have updated this sooner. I contacted the FGC team shortly after receiving my results to ask for more information about this reported heteroplasmy. After reviewing my data in more detail, FGC determined that based on the reads in my BAM file, my mitochondrial DNA does not show any heteroplasmy, and this errant result should not have appeared in my report.]

YDNA results

FGC grouped my YDNA results into two folders: YSTR and YSNP.

YSTR

YSTR results consisted of two output files generated from lobSTR. The first file contains roughly 3000 lines of data reporting identified YSTRs according to NIST/lobSTR standards, with some additional markers FGC has added to lobSTR.

The second file contains a subset of the first file including only those YSTR markers which FamilyTreeDNA tests and reports, counted according to FamilyTreeDNA’s standards. Mine reported values for 95 FTDNA-style markers.

Prior to whole genome sequencing I had only FTDNA’s 67 marker YSTR results combined with 23andMe‘s v3 chip Y SNPs with which to determine my YDNA haplogroup, giving nothing more specific than the huge R1b M269 group. I have not yet found my YSTR results from FGC particularly useful as not very many males from my line appear to have taken YDNA testing, so I do not have many data points to compare to.  I do have several close matches on FTDNA’s 67 marker test sharing variants of my surname which have convinced me that I don’t need to consider non paternity events along my direct male line going back at least 400 years based on the known years when Paradis YDNA arrived in Canada from France.

Once more Paradis-descended men take YDNA tests like the Veritas myGenome, FGC Y-Elite, FTDNA Big Y or others, I expect this data to have more value in tracing drift across this line.

YSNP

YSNP results consisted of five separate files. Two described as variant discovery reports, two as variant genotyping, and one haplogroup classification report containing output from yKnot that identifies my sample’s place in the ISOGG tree.

Haplogroup Classification

I have provided below a portion of my yKnot file showing the placement of my YDNA on the ISOGG tree back to the R1b M343+ branch. For the moment, I sit on the S1217+/Z295+ branch (ISOGG, Big Tree). I do not match any subclades of S1217+/Z295+ yet identified, but I will follow developments in this area, and, having my genome already sequenced, can place myself on future revised trees without the need for any further SNP testing.

*Extras: Z1518+, Y4010+, 50f2(P)+, Z14907+, PH3244*, Y2550+, P80+, CTS1789+, CTS12019+, L1228+, M3629+, Z3327+, Z28+, FGC5628+, CTS12440+, PF2372+, M162_1*, FGC5085+, Z13028+, P266+, Z12253+, L798+, DYS257_2+, Z28771*, P27.2_2+, Y2252+, CTS616+, CTS2646*, M118+, M236+, Y2754+, FGC20667*, M141+, L665+, L588+, Z14350+, P34_5+, Z6859+, Z889+, Z13537*, Z6171+, Z1237+, FGC756+, BY451+,     P19_1*, P79*, PF2276+, Z16986+, M5220+, FGC1920+, Z12467+, Z1842+, V161.1+, V190+, CTS6911+, CTS2518+, FGC4872+, Y5185*, Y2986+, Z1101+, CTS32+, Z15165+, IMS-JST022457+, PF2779+, S730+, S504+, Z836*, Z14050+, IMS-JST029149+, M1994*, L990+, P198+, Z16208+, PF3126+, Z2182*
R1b1a2a1a2a1a1a
|Matches: S1217+, Z295+
|____R1b1a2a1a2a1a1
     |Matches: S230+, Z209+, S356+, Z220+
     |____R1b1a2a1a2a1a
          |Matches: Z272+
          |*No-calls: Z274?, S229?
          |____R1b1a2a1a2a1
               |Matches: Z195+, S227+
               |*No-calls: S355?, Z196?
               |____R1b1a2a1a2a
                    |Matches: DF27+, S250+
                    |____R1b1a2a1a2
                         |Matches: P312+, PF6547+, S116+
                         |____R1b1a2a1a
                              |Matches: L151+, PF6542+, L52+, PF6541+, P310+, PF6546+, S129+, P311+, PF6545+, S128+, PF6539+
                              |*No-calls: (being investigated as to placement: L11?, S127)?
                              |____R1b1a2a1
                                   |Matches: L51+, M412+, PF6536+, S167+
                                   |____R1b1a2a
                                        |Matches: L23+, PF6534+, S141+, L49.1+, S349.1+
                                        |____R1b1a2
                                             |Matches: M269+, CTS623+, CTS2664+, PF6454+, CTS3575+, PF6457+, CTS8728+, L1063+, PF6480+, S13+, CTS12478+, PF6529+, F1794+, PF6455+, L265+, PF6431+, L407+, PF6252+, L478+, PF6403+, L482+, PF6427+, L483+, L500+,   PF6481+, L773+, PF6421+, YSC0000276+, L1353+, PF6489+, YSC0000294+, M520+, PF6410+, PF6399+, S10+, PF6404+, PF6505+, YSC0000225+,   PF6409+, PF6411+, PF6425+, PF6430+, PF6432+, PF6434+, PF6438+, PF6475+, S17+, YSC0000269+, PF6482+, YSC0000203+, PF6485+, S3+, PF6494+, PF6495+, PF6497+, YSC0000219+, PF6500+, PF6507+, PF6509+, L150.1+, PF6274.1+, S351.1+
                                             |*No-calls: PF6443?
                                             |**Mismatches: CTS8591- (exp. +), CTS8665- (exp. +), FGC464- (exp. +), CTS10834- (exp. +), CTS11468- (exp. +), FGC49- (exp. +)
                                             |____R1b1a
                                                  |Matches: P297+, PF6398+, L320+
                                                  |____R1b1
                                                       |Matches: P25_3+, L278+, M415+, PF6251+
                                                       |**Mismatches: P25_1- (exp. +), P25_2- (exp. +)
                                                       |____R1b
                                                            |Matches: M343+, PF6242+

Variant Genotyping

The first variant genotyping file provides my results at a little over 54,000 known SNPs. The second variant genotyping file provides results for an additional 16,600 SNPs. The results provided include counts of each base called at the SNP position as identified in my BAM file data, the SNP position on the chromosome, and the build 37 reference sequence call at that position. I do not know the criteria used to place each SNP in each file. I consider these files more as an intermediate step in the data analysis, used to generate the other returned files, but I expect I will find some more direct use for them as well.

Variant Discovery

The two variant discovery reports provide the most detailed and useful information in my opinion, as they include quality rankings on variants as well as the specific details of variants such as SNPs and INDELs. Even more usefully, these files contain the results for the kits most similar to mine within FGC’s database, which can help in identifying private variants that originated in much more recent genealogical times. Because these files include data from others as well as my own, I cannot comfortably release them to the general public without redacting other individuals’ data. For public facing purposes if someone wanted to run comparisons against my detailed data I would most likely refer them to the Big Tree (if R1b) or advise that they pursue their own analysis with FGC directly. The how-to document FGC provides with this analysis (Reading the Full Genomes analysis reports) explains working with this data much better than I could in my own words. The inclusion of quality scores greatly simplifies the process of narrowing down on key SNPs, and I look forward to spending more time with this data — probably after more Paradis males have had next generation YDNA sequencing as my results appear rather distant from the nearest matching males in any database except for the one Paradis I’ve found with a Big-Y at FTDNA.

Data Sharing

It pleased me to see that FGC offers a very quick and easy method to share your results with any email address you provide. I took advantage of this to share my data with Alex Williamson for inclusion in the Big Tree to aid in reconstructing the phylogeny of the R1b tree under R P312. For now, my Big Tree entry sits in the R-Z295/S1217 paragroup, awaiting more submissions sharing SNPs with me to help identify a terminal SNP more recent than the estimated 3900 year old Z295. I don’t match any SNPs identified as downstream of Z295 on the FTDNA tree, the ISOGG tree, or the YFull tree. I encourage any other Z295 or Paradis/Pardy/Paradee/etc male to get your YDNA analyzed and shared with these projects so we can better place ourselves on the tree.

More Info

If this has interested you, I highly recommend you take a look at another review and description of FGC’s analysis.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s