Category Archives: Whole Genome Sequence

Review: Full Genomes Corp third party analysis of Veritas Genetics raw WGS data

In this post, I will provide my review of Full Genomes Corp‘s service offering third party analysis of raw data produced by Veritas Genetics‘ $999 whole genome sequencing (Veritas myGenome). After I released my raw genome data to the public domain, FGC contacted me and offered to run my WGS data through their BAM processing pipeline at no cost. I naturally accepted and agreed to write a review.

This service from FGC includes three categories of analysis: mtDNA, YDNA, and autosomal ancestry. As of now, I have received my mtDNA and YDNA results; the autosomal analysis takes longer to produce and I will leave it out of scope for this review.

Getting Started

After creating an account on the FGC site, I needed to provide them with access to the BAM file that Veritas Genetics produced. My participation in the Personal Genome Project made this easy as I only had to give them the URL to my BAM file on the PGP public data repository.

A little bit more than two weeks later I received email reporting that I had results ready. When I logged back in to FGC a prominent link provided access to download all of my results in a single zip archive. This zip archive contained a readme file directing me to two PDF documents with further information: one focused on extracting private SNPs from YDNA results and the second describing the individual data files FGC returns, which I will get to below.

Mitochondrial DNA results

I have already had my full mitochondrial DNA sequenced by FamilyTreeDNA, so I did not expect to learn anything new from FGC’s data analysis, which produced two files. The first file contains a list of variants found in my mtDNA with respect to the Yoruba reference sequence by position. The second file contains my full mtDNA sequence in FASTA format.

The FASTA file took me by surprise, as they indicated a heteroplasmic length variant that FamilyTreeDNA had not come across (or had not informed me of) in their Sanger sequencing. FGC found a deletion at position 310, the loss of a T flanked by C repeats on both sides. I do not know if this information will turn out relevant for me, but who knows, I prefer to have it.
[ADDED 20170306: I should have updated this sooner. I contacted the FGC team shortly after receiving my results to ask for more information about this reported heteroplasmy. After reviewing my data in more detail, FGC determined that based on the reads in my BAM file, my mitochondrial DNA does not show any heteroplasmy, and this errant result should not have appeared in my report.]

YDNA results

FGC grouped my YDNA results into two folders: YSTR and YSNP.


YSTR results consisted of two output files generated from lobSTR. The first file contains roughly 3000 lines of data reporting identified YSTRs according to NIST/lobSTR standards, with some additional markers FGC has added to lobSTR.

The second file contains a subset of the first file including only those YSTR markers which FamilyTreeDNA tests and reports, counted according to FamilyTreeDNA’s standards. Mine reported values for 95 FTDNA-style markers.

Prior to whole genome sequencing I had only FTDNA’s 67 marker YSTR results combined with 23andMe‘s v3 chip Y SNPs with which to determine my YDNA haplogroup, giving nothing more specific than the huge R1b M269 group. I have not yet found my YSTR results from FGC particularly useful as not very many males from my line appear to have taken YDNA testing, so I do not have many data points to compare to.  I do have several close matches on FTDNA’s 67 marker test sharing variants of my surname which have convinced me that I don’t need to consider non paternity events along my direct male line going back at least 400 years based on the known years when Paradis YDNA arrived in Canada from France.

Once more Paradis-descended men take YDNA tests like the Veritas myGenome, FGC Y-Elite, FTDNA Big Y or others, I expect this data to have more value in tracing drift across this line.


YSNP results consisted of five separate files. Two described as variant discovery reports, two as variant genotyping, and one haplogroup classification report containing output from yKnot that identifies my sample’s place in the ISOGG tree.

Haplogroup Classification

I have provided below a portion of my yKnot file showing the placement of my YDNA on the ISOGG tree back to the R1b M343+ branch. For the moment, I sit on the S1217+/Z295+ branch (ISOGG, Big Tree). I do not match any subclades of S1217+/Z295+ yet identified, but I will follow developments in this area, and, having my genome already sequenced, can place myself on future revised trees without the need for any further SNP testing.

*Extras: Z1518+, Y4010+, 50f2(P)+, Z14907+, PH3244*, Y2550+, P80+, CTS1789+, CTS12019+, L1228+, M3629+, Z3327+, Z28+, FGC5628+, CTS12440+, PF2372+, M162_1*, FGC5085+, Z13028+, P266+, Z12253+, L798+, DYS257_2+, Z28771*, P27.2_2+, Y2252+, CTS616+, CTS2646*, M118+, M236+, Y2754+, FGC20667*, M141+, L665+, L588+, Z14350+, P34_5+, Z6859+, Z889+, Z13537*, Z6171+, Z1237+, FGC756+, BY451+,     P19_1*, P79*, PF2276+, Z16986+, M5220+, FGC1920+, Z12467+, Z1842+, V161.1+, V190+, CTS6911+, CTS2518+, FGC4872+, Y5185*, Y2986+, Z1101+, CTS32+, Z15165+, IMS-JST022457+, PF2779+, S730+, S504+, Z836*, Z14050+, IMS-JST029149+, M1994*, L990+, P198+, Z16208+, PF3126+, Z2182*
|Matches: S1217+, Z295+
     |Matches: S230+, Z209+, S356+, Z220+
          |Matches: Z272+
          |*No-calls: Z274?, S229?
               |Matches: Z195+, S227+
               |*No-calls: S355?, Z196?
                    |Matches: DF27+, S250+
                         |Matches: P312+, PF6547+, S116+
                              |Matches: L151+, PF6542+, L52+, PF6541+, P310+, PF6546+, S129+, P311+, PF6545+, S128+, PF6539+
                              |*No-calls: (being investigated as to placement: L11?, S127)?
                                   |Matches: L51+, M412+, PF6536+, S167+
                                        |Matches: L23+, PF6534+, S141+, L49.1+, S349.1+
                                             |Matches: M269+, CTS623+, CTS2664+, PF6454+, CTS3575+, PF6457+, CTS8728+, L1063+, PF6480+, S13+, CTS12478+, PF6529+, F1794+, PF6455+, L265+, PF6431+, L407+, PF6252+, L478+, PF6403+, L482+, PF6427+, L483+, L500+,   PF6481+, L773+, PF6421+, YSC0000276+, L1353+, PF6489+, YSC0000294+, M520+, PF6410+, PF6399+, S10+, PF6404+, PF6505+, YSC0000225+,   PF6409+, PF6411+, PF6425+, PF6430+, PF6432+, PF6434+, PF6438+, PF6475+, S17+, YSC0000269+, PF6482+, YSC0000203+, PF6485+, S3+, PF6494+, PF6495+, PF6497+, YSC0000219+, PF6500+, PF6507+, PF6509+, L150.1+, PF6274.1+, S351.1+
                                             |*No-calls: PF6443?
                                             |**Mismatches: CTS8591- (exp. +), CTS8665- (exp. +), FGC464- (exp. +), CTS10834- (exp. +), CTS11468- (exp. +), FGC49- (exp. +)
                                                  |Matches: P297+, PF6398+, L320+
                                                       |Matches: P25_3+, L278+, M415+, PF6251+
                                                       |**Mismatches: P25_1- (exp. +), P25_2- (exp. +)
                                                            |Matches: M343+, PF6242+

Variant Genotyping

The first variant genotyping file provides my results at a little over 54,000 known SNPs. The second variant genotyping file provides results for an additional 16,600 SNPs. The results provided include counts of each base called at the SNP position as identified in my BAM file data, the SNP position on the chromosome, and the build 37 reference sequence call at that position. I do not know the criteria used to place each SNP in each file. I consider these files more as an intermediate step in the data analysis, used to generate the other returned files, but I expect I will find some more direct use for them as well.

Variant Discovery

The two variant discovery reports provide the most detailed and useful information in my opinion, as they include quality rankings on variants as well as the specific details of variants such as SNPs and INDELs. Even more usefully, these files contain the results for the kits most similar to mine within FGC’s database, which can help in identifying private variants that originated in much more recent genealogical times. Because these files include data from others as well as my own, I cannot comfortably release them to the general public without redacting other individuals’ data. For public facing purposes if someone wanted to run comparisons against my detailed data I would most likely refer them to the Big Tree (if R1b) or advise that they pursue their own analysis with FGC directly. The how-to document FGC provides with this analysis (Reading the Full Genomes analysis reports) explains working with this data much better than I could in my own words. The inclusion of quality scores greatly simplifies the process of narrowing down on key SNPs, and I look forward to spending more time with this data — probably after more Paradis males have had next generation YDNA sequencing as my results appear rather distant from the nearest matching males in any database except for the one Paradis I’ve found with a Big-Y at FTDNA.

Data Sharing

It pleased me to see that FGC offers a very quick and easy method to share your results with any email address you provide. I took advantage of this to share my data with Alex Williamson for inclusion in the Big Tree to aid in reconstructing the phylogeny of the R1b tree under R P312. For now, my Big Tree entry sits in the R-Z295/S1217 paragroup, awaiting more submissions sharing SNPs with me to help identify a terminal SNP more recent than the estimated 3900 year old Z295. I don’t match any SNPs identified as downstream of Z295 on the FTDNA tree, the ISOGG tree, or the YFull tree. I encourage any other Z295 or Paradis/Pardy/Paradee/etc male to get your YDNA analyzed and shared with these projects so we can better place ourselves on the tree.

More Info

If this has interested you, I highly recommend you take a look at another review and description of FGC’s analysis.

Take my $1000 genome, please!

I have just released my whole genome sequence (WGS) to the public domain (CC0, no rights reserved), via the Harvard Personal Genome Project (PGP). I believe that my data represents both the first $1000 genome-with-analysis ever performed as well as the first $1000 genome released for public use. Thank you to both the PGP and to Veritas Genetics for making this possible. I would like to specifically thank Mirza Cifric, CEO of Veritas Genetics and also Christen Hart of Veritas for acting as my liaison and dealing with my frequent email requests for status updates. From my PGP profile page you can download my genome data (as a BAM file (17.8GB) or in VCF format (383MB)), as well as my 23andMe (v3, pre-FDA letter) SNP chip data and my full mitochondrial DNA sequence as tested by FamilyTreeDNA (since deposited in GenBank as accession ID KU530226).

Why would I do this?

Put simply, I wanted to make a contribution to science. Further, since working for a genomic drug development company in the 2000s where I met, then married, a bioinformatician, I’ve had an interest in the potential applications of genomics, from what some then referred to as the “pharmaceutically tractable genome” to today’s “precision medicine”. That employer spun off an early DNA sequencing platform (454 Life Sciences pyrosequencing, the first company to complete and make public an individual human genome), and I find it fitting that an ex-employee, and one from the IT staff, not even the scientific team, would release the first public $1000 genome.

I would like to see science make some good use of my genetic data. Only a relatively small number of whole genome sequences available for scientific research without privacy or intellectual property encumbrances exist. As a participant in the PGP, by making my genome available I hope not only to directly support scientific research but to aid the PGP’s other research goal to identify the risk and consequences of having one’s genetic data available to the public without any effort at de-identification or obfuscation. I have the benefit of living in one of the few states with genetic information laws that exceed the US Federal Genetic Information Nondiscrimination Act in placing restrictions on life insurance providers and others.

After my first blood labs with my current primary care doctor, she told me that I had the absolute worst blood levels of vitamin D that she had ever seen, along with the best HDL/LDL cholesterol levels she had seen. This comes from a genetic basis, not anything that I have pursued through diet or lifestyle. In fact my cholesterol should be, frankly, terrible, and though I live only a few miles south of the 45th parallel I get enough sun that lack of exposure can’t account for my vitamin D levels alone. My 23andMe data, when run through Promethease, reveals a train wreck throughput the vitamin D pathway, as well as matching many variants known to increase HDL cholesterol. With my whole genome sequence released for any imaginable use, I hope that researchers can either spot something unique enough on its own or work my data into genome wide association studies (GWAS) to tease out some drug targets or relevant alleles.

As a PGP participant I have filled out the PGP’s phenotype surveys to help associate phenotypes with my genotype. I have done the same at OpenHumans and remain willing to provide further phenotype data on request. I will attend the GET Conference and GET Labs 2016 at the end of April and get signed up with some other research studies.

You can also find my autosomal SNP chip data on GEDMatch as kit M205442, my YDNA data at ysearch under id CZVXU, and my full mitochondrial DNA sequence in GenBank as KU530226 (though services report my mtDNA haplogroup as U2e1*, I hope the next build of PhyloTree will note the mtDNA SNPs I carry extraneous to U2e1 and define a new haplogroup as with my deposition several mtDNA sequence motifs now have three independent depositions, enough to justify naming a new U2e1* branch). I have much of my genealogy traced several generations back and several apparent triangulation groups worth of matches. Genealogy traces my surname back to the Paradis in Quebec but hits a brick wall in the mid 1800s, though my YDNA 67-STR results at FTDNA show close matches with other tested Paradis males who have traceable lineages back to Pierre Paradis of Mortagne-au-Perche, France (d. 1675), apparent patriarch of new world Paradis/Pardy lines. Several of my lines go back to early US colonials (Trowbridge provides my nexus to Charlemagne, though I’ve found no Mayflower descendents), as well as mixed ancestry (French/German/more) Creoles along the German Coast in Louisiana. I also have a bit of direct Scottish (Halcro) ancestry along with other Scots-Irish.

How can a security and privacy aware individual choose to release this data?

For me, the recognition that sequencing continues to fall in price and will eventually become ubiquitous to the point of banality, coupled with the fact that we shed DNA all day long convinces me that any genetic privacy we may believe we have now exists only for a disappearing moment in history and only in lieu of a determined adversary willing to put some effort into collection. Setting aside the issue of disclosing one’s unique genetic signature to third parties, simply knowing what secrets sit in one’s own DNA empowers some individuals but makes others uneasy. Some people do not want to know if their genetics give them a high probability of Alzheimers, or a disposition to cancer. Some regulators believe they cannot trust the public to make responsible decisions once given knowledge of the forbidden fruit in their genetic code. Because science does not yet know enough about the complex interactions of all parts of the genome to determine the exact medical significance of every gene or non-gene variant, the interpretation of your static genome can and will change with the ongoing discovery of new genetic associations and with failures to replicate previously reported associations. By donating my sequence to an unencumbered public dataset I hope to help speed up this process and embolden others to take this step to share for science, with eyes wide open as to the limitations of data de-identification and possibilities of personalized medicine. Whether you share your genome through the PGP, your microbiome through uBiome, the next virus you catch through GoViral, your FitBit data through OpenHumans, your direct to consumer SNP chip results through OpenSNP, or any other data through any other platform, each of us has a unique chance to contribute to research to better lives today and our species tomorrow.

What does whole genome sequencing give a non-expert that SNP genotyping doesn’t?

Several years ago I took 23andMe’s genotyping test. As this occurred prior to the FDA sending 23andMe a nastygram barring them from reporting health-relevant results, I received a decent amount of information relevant to health issues. So why bother having a whole genome sequence done? To put it simply, a WGS has more long-term value than a genotyping SNP chip. As 23andMe V2 customers discovered, as time moves on and science learns more about genetic variants, and as new builds of the human genome get released, SNP results based on older data lose their relevance. New genome scaffolds obsolete what we believed we knew about older SNPs. New SNPs get discovered with more meaningful disease associations than those believed to associate with diseases years ago during chip design. With my whole genome sequence in my pocket, I have better positioning for the future as I can look up newly-reported variants going forward whether or not the designer of the probes on a SNP chip foresaw the relevance of that genetic region. If I develop cancer in the future, I or my medical providers can compare the sequence of a tumor cell to my genome sequence, easing the process of identifying genes that may have gone haywire and caused cancer, and potentially informing the selection of anti-cancer drugs that could save my life. Further, by ordering and releasing my whole genome sequence, scientists working with public datasets can perform more useful analyses than those available simply from releasing my SNP chip data.

Go use my data!


Mike Cariaso has graciously run Promethease against my WGS data. Results here. Unfortunately Promethease results expire after a number of days, rendering this report now inaccessible.