Tag Archives: Genealogy

Review: Full Genomes Corp third party analysis of Veritas Genetics raw WGS data

In this post, I will provide my review of Full Genomes Corp‘s service offering third party analysis of raw data produced by Veritas Genetics‘ $999 whole genome sequencing (Veritas myGenome). After I released my raw genome data to the public domain, FGC contacted me and offered to run my WGS data through their BAM processing pipeline at no cost. I naturally accepted and agreed to write a review.

This service from FGC includes three categories of analysis: mtDNA, YDNA, and autosomal ancestry. As of now, I have received my mtDNA and YDNA results; the autosomal analysis takes longer to produce and I will leave it out of scope for this review.

Getting Started

After creating an account on the FGC site, I needed to provide them with access to the BAM file that Veritas Genetics produced. My participation in the Personal Genome Project made this easy as I only had to give them the URL to my BAM file on the PGP public data repository.

A little bit more than two weeks later I received email reporting that I had results ready. When I logged back in to FGC a prominent link provided access to download all of my results in a single zip archive. This zip archive contained a readme file directing me to two PDF documents with further information: one focused on extracting private SNPs from YDNA results and the second describing the individual data files FGC returns, which I will get to below.

Mitochondrial DNA results

I have already had my full mitochondrial DNA sequenced by FamilyTreeDNA, so I did not expect to learn anything new from FGC’s data analysis, which produced two files. The first file contains a list of variants found in my mtDNA with respect to the Yoruba reference sequence by position. The second file contains my full mtDNA sequence in FASTA format.

The FASTA file took me by surprise, as they indicated a heteroplasmic length variant that FamilyTreeDNA had not come across (or had not informed me of) in their Sanger sequencing. FGC found a deletion at position 310, the loss of a T flanked by C repeats on both sides. I do not know if this information will turn out relevant for me, but who knows, I prefer to have it.
[ADDED 20170306: I should have updated this sooner. I contacted the FGC team shortly after receiving my results to ask for more information about this reported heteroplasmy. After reviewing my data in more detail, FGC determined that based on the reads in my BAM file, my mitochondrial DNA does not show any heteroplasmy, and this errant result should not have appeared in my report.]

YDNA results

FGC grouped my YDNA results into two folders: YSTR and YSNP.

YSTR

YSTR results consisted of two output files generated from lobSTR. The first file contains roughly 3000 lines of data reporting identified YSTRs according to NIST/lobSTR standards, with some additional markers FGC has added to lobSTR.

The second file contains a subset of the first file including only those YSTR markers which FamilyTreeDNA tests and reports, counted according to FamilyTreeDNA’s standards. Mine reported values for 95 FTDNA-style markers.

Prior to whole genome sequencing I had only FTDNA’s 67 marker YSTR results combined with 23andMe‘s v3 chip Y SNPs with which to determine my YDNA haplogroup, giving nothing more specific than the huge R1b M269 group. I have not yet found my YSTR results from FGC particularly useful as not very many males from my line appear to have taken YDNA testing, so I do not have many data points to compare to.  I do have several close matches on FTDNA’s 67 marker test sharing variants of my surname which have convinced me that I don’t need to consider non paternity events along my direct male line going back at least 400 years based on the known years when Paradis YDNA arrived in Canada from France.

Once more Paradis-descended men take YDNA tests like the Veritas myGenome, FGC Y-Elite, FTDNA Big Y or others, I expect this data to have more value in tracing drift across this line.

YSNP

YSNP results consisted of five separate files. Two described as variant discovery reports, two as variant genotyping, and one haplogroup classification report containing output from yKnot that identifies my sample’s place in the ISOGG tree.

Haplogroup Classification

I have provided below a portion of my yKnot file showing the placement of my YDNA on the ISOGG tree back to the R1b M343+ branch. For the moment, I sit on the S1217+/Z295+ branch (ISOGG, Big Tree). I do not match any subclades of S1217+/Z295+ yet identified, but I will follow developments in this area, and, having my genome already sequenced, can place myself on future revised trees without the need for any further SNP testing.

*Extras: Z1518+, Y4010+, 50f2(P)+, Z14907+, PH3244*, Y2550+, P80+, CTS1789+, CTS12019+, L1228+, M3629+, Z3327+, Z28+, FGC5628+, CTS12440+, PF2372+, M162_1*, FGC5085+, Z13028+, P266+, Z12253+, L798+, DYS257_2+, Z28771*, P27.2_2+, Y2252+, CTS616+, CTS2646*, M118+, M236+, Y2754+, FGC20667*, M141+, L665+, L588+, Z14350+, P34_5+, Z6859+, Z889+, Z13537*, Z6171+, Z1237+, FGC756+, BY451+,     P19_1*, P79*, PF2276+, Z16986+, M5220+, FGC1920+, Z12467+, Z1842+, V161.1+, V190+, CTS6911+, CTS2518+, FGC4872+, Y5185*, Y2986+, Z1101+, CTS32+, Z15165+, IMS-JST022457+, PF2779+, S730+, S504+, Z836*, Z14050+, IMS-JST029149+, M1994*, L990+, P198+, Z16208+, PF3126+, Z2182*
R1b1a2a1a2a1a1a
|Matches: S1217+, Z295+
|____R1b1a2a1a2a1a1
     |Matches: S230+, Z209+, S356+, Z220+
     |____R1b1a2a1a2a1a
          |Matches: Z272+
          |*No-calls: Z274?, S229?
          |____R1b1a2a1a2a1
               |Matches: Z195+, S227+
               |*No-calls: S355?, Z196?
               |____R1b1a2a1a2a
                    |Matches: DF27+, S250+
                    |____R1b1a2a1a2
                         |Matches: P312+, PF6547+, S116+
                         |____R1b1a2a1a
                              |Matches: L151+, PF6542+, L52+, PF6541+, P310+, PF6546+, S129+, P311+, PF6545+, S128+, PF6539+
                              |*No-calls: (being investigated as to placement: L11?, S127)?
                              |____R1b1a2a1
                                   |Matches: L51+, M412+, PF6536+, S167+
                                   |____R1b1a2a
                                        |Matches: L23+, PF6534+, S141+, L49.1+, S349.1+
                                        |____R1b1a2
                                             |Matches: M269+, CTS623+, CTS2664+, PF6454+, CTS3575+, PF6457+, CTS8728+, L1063+, PF6480+, S13+, CTS12478+, PF6529+, F1794+, PF6455+, L265+, PF6431+, L407+, PF6252+, L478+, PF6403+, L482+, PF6427+, L483+, L500+,   PF6481+, L773+, PF6421+, YSC0000276+, L1353+, PF6489+, YSC0000294+, M520+, PF6410+, PF6399+, S10+, PF6404+, PF6505+, YSC0000225+,   PF6409+, PF6411+, PF6425+, PF6430+, PF6432+, PF6434+, PF6438+, PF6475+, S17+, YSC0000269+, PF6482+, YSC0000203+, PF6485+, S3+, PF6494+, PF6495+, PF6497+, YSC0000219+, PF6500+, PF6507+, PF6509+, L150.1+, PF6274.1+, S351.1+
                                             |*No-calls: PF6443?
                                             |**Mismatches: CTS8591- (exp. +), CTS8665- (exp. +), FGC464- (exp. +), CTS10834- (exp. +), CTS11468- (exp. +), FGC49- (exp. +)
                                             |____R1b1a
                                                  |Matches: P297+, PF6398+, L320+
                                                  |____R1b1
                                                       |Matches: P25_3+, L278+, M415+, PF6251+
                                                       |**Mismatches: P25_1- (exp. +), P25_2- (exp. +)
                                                       |____R1b
                                                            |Matches: M343+, PF6242+

Variant Genotyping

The first variant genotyping file provides my results at a little over 54,000 known SNPs. The second variant genotyping file provides results for an additional 16,600 SNPs. The results provided include counts of each base called at the SNP position as identified in my BAM file data, the SNP position on the chromosome, and the build 37 reference sequence call at that position. I do not know the criteria used to place each SNP in each file. I consider these files more as an intermediate step in the data analysis, used to generate the other returned files, but I expect I will find some more direct use for them as well.

Variant Discovery

The two variant discovery reports provide the most detailed and useful information in my opinion, as they include quality rankings on variants as well as the specific details of variants such as SNPs and INDELs. Even more usefully, these files contain the results for the kits most similar to mine within FGC’s database, which can help in identifying private variants that originated in much more recent genealogical times. Because these files include data from others as well as my own, I cannot comfortably release them to the general public without redacting other individuals’ data. For public facing purposes if someone wanted to run comparisons against my detailed data I would most likely refer them to the Big Tree (if R1b) or advise that they pursue their own analysis with FGC directly. The how-to document FGC provides with this analysis (Reading the Full Genomes analysis reports) explains working with this data much better than I could in my own words. The inclusion of quality scores greatly simplifies the process of narrowing down on key SNPs, and I look forward to spending more time with this data — probably after more Paradis males have had next generation YDNA sequencing as my results appear rather distant from the nearest matching males in any database except for the one Paradis I’ve found with a Big-Y at FTDNA.

Data Sharing

It pleased me to see that FGC offers a very quick and easy method to share your results with any email address you provide. I took advantage of this to share my data with Alex Williamson for inclusion in the Big Tree to aid in reconstructing the phylogeny of the R1b tree under R P312. For now, my Big Tree entry sits in the R-Z295/S1217 paragroup, awaiting more submissions sharing SNPs with me to help identify a terminal SNP more recent than the estimated 3900 year old Z295. I don’t match any SNPs identified as downstream of Z295 on the FTDNA tree, the ISOGG tree, or the YFull tree. I encourage any other Z295 or Paradis/Pardy/Paradee/etc male to get your YDNA analyzed and shared with these projects so we can better place ourselves on the tree.

More Info

If this has interested you, I highly recommend you take a look at another review and description of FGC’s analysis.

Advertisement

Take my $1000 genome, please!

I have just released my whole genome sequence (WGS) to the public domain (CC0, no rights reserved), via the Harvard Personal Genome Project (PGP). I believe that my data represents both the first $1000 genome-with-analysis ever performed as well as the first $1000 genome released for public use. Thank you to both the PGP and to Veritas Genetics for making this possible. I would like to specifically thank Mirza Cifric, CEO of Veritas Genetics and also Christen Hart of Veritas for acting as my liaison and dealing with my frequent email requests for status updates. From my PGP profile page you can download my genome data (as a BAM file (17.8GB) or in VCF format (383MB)), as well as my 23andMe (v3, pre-FDA letter) SNP chip data and my full mitochondrial DNA sequence as tested by FamilyTreeDNA (since deposited in GenBank as accession ID KU530226).

Why would I do this?

Put simply, I wanted to make a contribution to science. Further, since working for a genomic drug development company in the 2000s where I met, then married, a bioinformatician, I’ve had an interest in the potential applications of genomics, from what some then referred to as the “pharmaceutically tractable genome” to today’s “precision medicine”. That employer spun off an early DNA sequencing platform (454 Life Sciences pyrosequencing, the first company to complete and make public an individual human genome), and I find it fitting that an ex-employee, and one from the IT staff, not even the scientific team, would release the first public $1000 genome.

I would like to see science make some good use of my genetic data. Only a relatively small number of whole genome sequences available for scientific research without privacy or intellectual property encumbrances exist. As a participant in the PGP, by making my genome available I hope not only to directly support scientific research but to aid the PGP’s other research goal to identify the risk and consequences of having one’s genetic data available to the public without any effort at de-identification or obfuscation. I have the benefit of living in one of the few states with genetic information laws that exceed the US Federal Genetic Information Nondiscrimination Act in placing restrictions on life insurance providers and others.

After my first blood labs with my current primary care doctor, she told me that I had the absolute worst blood levels of vitamin D that she had ever seen, along with the best HDL/LDL cholesterol levels she had seen. This comes from a genetic basis, not anything that I have pursued through diet or lifestyle. In fact my cholesterol should be, frankly, terrible, and though I live only a few miles south of the 45th parallel I get enough sun that lack of exposure can’t account for my vitamin D levels alone. My 23andMe data, when run through Promethease, reveals a train wreck throughput the vitamin D pathway, as well as matching many variants known to increase HDL cholesterol. With my whole genome sequence released for any imaginable use, I hope that researchers can either spot something unique enough on its own or work my data into genome wide association studies (GWAS) to tease out some drug targets or relevant alleles.

As a PGP participant I have filled out the PGP’s phenotype surveys to help associate phenotypes with my genotype. I have done the same at OpenHumans and remain willing to provide further phenotype data on request. I will attend the GET Conference and GET Labs 2016 at the end of April and get signed up with some other research studies.

You can also find my autosomal SNP chip data on GEDMatch as kit M205442, my YDNA data at ysearch under id CZVXU, and my full mitochondrial DNA sequence in GenBank as KU530226 (though services report my mtDNA haplogroup as U2e1*, I hope the next build of PhyloTree will note the mtDNA SNPs I carry extraneous to U2e1 and define a new haplogroup as with my deposition several mtDNA sequence motifs now have three independent depositions, enough to justify naming a new U2e1* branch). I have much of my genealogy traced several generations back and several apparent triangulation groups worth of matches. Genealogy traces my surname back to the Paradis in Quebec but hits a brick wall in the mid 1800s, though my YDNA 67-STR results at FTDNA show close matches with other tested Paradis males who have traceable lineages back to Pierre Paradis of Mortagne-au-Perche, France (d. 1675), apparent patriarch of new world Paradis/Pardy lines. Several of my lines go back to early US colonials (Trowbridge provides my nexus to Charlemagne, though I’ve found no Mayflower descendents), as well as mixed ancestry (French/German/more) Creoles along the German Coast in Louisiana. I also have a bit of direct Scottish (Halcro) ancestry along with other Scots-Irish.

How can a security and privacy aware individual choose to release this data?

For me, the recognition that sequencing continues to fall in price and will eventually become ubiquitous to the point of banality, coupled with the fact that we shed DNA all day long convinces me that any genetic privacy we may believe we have now exists only for a disappearing moment in history and only in lieu of a determined adversary willing to put some effort into collection. Setting aside the issue of disclosing one’s unique genetic signature to third parties, simply knowing what secrets sit in one’s own DNA empowers some individuals but makes others uneasy. Some people do not want to know if their genetics give them a high probability of Alzheimers, or a disposition to cancer. Some regulators believe they cannot trust the public to make responsible decisions once given knowledge of the forbidden fruit in their genetic code. Because science does not yet know enough about the complex interactions of all parts of the genome to determine the exact medical significance of every gene or non-gene variant, the interpretation of your static genome can and will change with the ongoing discovery of new genetic associations and with failures to replicate previously reported associations. By donating my sequence to an unencumbered public dataset I hope to help speed up this process and embolden others to take this step to share for science, with eyes wide open as to the limitations of data de-identification and possibilities of personalized medicine. Whether you share your genome through the PGP, your microbiome through uBiome, the next virus you catch through GoViral, your FitBit data through OpenHumans, your direct to consumer SNP chip results through OpenSNP, or any other data through any other platform, each of us has a unique chance to contribute to research to better lives today and our species tomorrow.

What does whole genome sequencing give a non-expert that SNP genotyping doesn’t?

Several years ago I took 23andMe’s genotyping test. As this occurred prior to the FDA sending 23andMe a nastygram barring them from reporting health-relevant results, I received a decent amount of information relevant to health issues. So why bother having a whole genome sequence done? To put it simply, a WGS has more long-term value than a genotyping SNP chip. As 23andMe V2 customers discovered, as time moves on and science learns more about genetic variants, and as new builds of the human genome get released, SNP results based on older data lose their relevance. New genome scaffolds obsolete what we believed we knew about older SNPs. New SNPs get discovered with more meaningful disease associations than those believed to associate with diseases years ago during chip design. With my whole genome sequence in my pocket, I have better positioning for the future as I can look up newly-reported variants going forward whether or not the designer of the probes on a SNP chip foresaw the relevance of that genetic region. If I develop cancer in the future, I or my medical providers can compare the sequence of a tumor cell to my genome sequence, easing the process of identifying genes that may have gone haywire and caused cancer, and potentially informing the selection of anti-cancer drugs that could save my life. Further, by ordering and releasing my whole genome sequence, scientists working with public datasets can perform more useful analyses than those available simply from releasing my SNP chip data.

Go use my data!

Updates

Mike Cariaso has graciously run Promethease against my WGS data. Results here. Unfortunately Promethease results expire after a number of days, rendering this report now inaccessible.

How to get started with genetic genealogy

This is a departure from what I usually write about, but technically it’s also about databases: GEDCOMs and genetic ones. This post will cover a general strategy to get started doing your own genetic genealogy work. I appreciate any comments you may have. If anyone is interested, I may write future posts on suggested tools and other tips.

Briefly, genetic genealogy is the act of supplementing traditional paper genealogy with genetic information. By doing so you can extend your family tree further, find distant (sometimes extremely distant) relatives and help confirm the details found in your genealogy research. If you were adopted or have known NPEs in your line back a few generations, this may be the only way to track down your real ancestors.

Overview

  1. Do as much genealogy as you can on paper
  2. Get yourself, and possibly other close relatives tested, by one of the well known companies whose tests enable this work
  3. Make contact with your matches as identified by those companies
  4. Compare family trees with your matches
  5. Share your genetic ancestry data in other places to broaden the scope of potential matches
  6. Extend your tree with the results of research done by your matches on your shared lines
  7. Make more contacts and use your previously confirmed ancestors to triangulate on your unknown matches

Step 1: Do Genealogy

So many others have written so much about getting started with and getting better at genealogy that I’m not going to cover this step in very much detail here. Do a few web searches, read what others have to say, and check for “how to” articles on any commercial genealogy sites you join.

The best way, in my opinion, to get started with genealogy is to stand on the shoulders of giants. Someone in your family, maybe a grandparent or second cousin probably already does genealogy research and would be happy to share their data. But in case you can’t find someone like that or just want to get started on your own, here’s a little advice.

Make an account on ancestry.com. They simply have one of the best, easiest to use archives of vital records, wills, immigrant entries, military records, newspaper articles and so on. You can start with a 14 day free full access subscription and try to nail down as much as possible, then choose to subscribe or not depending on how much progress you’re making.

The mid term goal of this genealogy work is to produce a GEDCOM file, which is a database of people, their relationships, and source citations back to primary documents that confirm the relationship claims made in the file. You will then upload this file to various sites to share your research and help others find their match to you. You can optionally privatize the file so that people born after 1900 have their names hidden to avoid revealing information about other people that may not share your enthusiasm for finding your roots.

While you work on your genealogy, proceed with DNA testing, the next step, because it takes a while and you’ll be spending a while waiting for your results.

Step 2: Get tested

You have several choices for testing. The big three companies are 23andMe, FamilyTreeDNA and AncestryDNA, but several other options exist for specialized use. I highly recommend 23andMe, for reasons I’ll explain below, but I’ll give some information about each. All three are based in the USA so the longer your family has been in the US, the more matches you will find (see digression below).

23andMe

Simply your best choice. For the same price, $99, you will receive genetic information about your health at the same time you receive information useful for genetic genealogy. 23andMe has busy community forums covering health, ancestry and genealogy, but the best part for our purposes is that they test more markers than the other options (since the other companies specifically do not test anything implicated in human health) and you can download your raw genetic data and have it processed by FamilyTreeDNA for a lower fee than having FTDNA test you directly.

The 23andMe test is a saliva test. They will send you a kit including a tube, into which you spit about a teaspoon of saliva, close the top, snap the paraffin seal to release the stabilization/lysing buffer solution and then send it back in a prepaid package. Totally painless unless you have trouble producing saliva or you are trying to test an infant.

FamilyTreeDNA

The strong point of FTDNA is 23andMe’s weak point. You only sign up for FTDNA if you are interested in genealogy, but many of 23andMe’s users are only there for health information and have zero interest in genealogy or helping you to research yours. The other strong point is their “transfer family finder” service which allows you to upload your 23andMe data file to FTDNA for a better price than testing directly with them. You’ll still receive all the same matches and benefits as if you had tested there directly.

Further, FTDNA has some test offerings the others don’t provide. While 23andMe will test enough single nucleotide polymorphisms (SNPs) on your Y-DNA and mitochondrial DNA to assign a high level haplogroup, FTDNA provides full mitochondrial sequencing and Y-DNA short terminal repeat (STR) testing. The Y-DNA test can help confirm genealogy along your direct male ancestor line, but the mitochondrial sequence is relatively useless for this kind of genealogy. I’ve had a 67-marker Y-DNA STR test done along with a full mitochondrial sequence, plus the family finder transfer of my 23andMe data.

FamilyTreeDNA does provide a way for you to download your raw test results. Their test is done by scraping a cotton swab on the inside of your cheek.

AncestryDNA

They are the most recent new provider of these tests. I have not used their testing service so I have no first hand knowledge of it. As I understand it they will scan your tree to find genealogical matches with your DNA matches and simplify the process of identifying your common ancestors. This sounds great, and it may be the best choice for those who can’t invest much time in this work, but the downside is that Ancestry has many users who aren’t as careful about validating and sourcing the data in their trees as a serious genealogist needs to do. You really have to doublecheck your match’s work more carefully than on other sites. Being new, their database is currently the smallest of the big three, but it is growing rapidly.

AncestryDNA does support user download of their raw test result data file. As with 23andMe, their test is performed with a saliva sample.

A Digression

The quality and number of matches you will find on any of these sites depends significantly on your family background and the backgrounds of others who have elected to test. The majority of users on these sites are American, so if you are the second generation of an immigrant family, new to the US, you will find only a few matches. But if you can trace your lines to ancestors in the early US, you’re going to have hundreds or even thousands of matches. Or if you come from a highly endogenous population like the Ashkenazi Jews, you will have a lot of matches but they will be so far back in time you’ll have a lot difficulty finding on-paper genealogical links.

Step 3: Make contact

I should call this step “wait”, since no matter which company you use, it will take a few weeks or months to get your results back. Use this time to work on your family tree some more.

Once you do receive your results, the fun starts. If you don’t check your email very frequently or your results have been in a while, you may already have matches starting to contact you. FTDNA contacts are generally made directly through email to the address you share when signing up. For 23andMe users, you can send or receive a “sharing request”, which if accepted allows you and your match to compare your results to each other and your other matches with whom you have an accepted sharing request.

How do you find your matches? On FTDNA you go to the Family Finder Matches tool and review the list of names, their family trees, and the significance of your match. I’ll cover significance later. On 23andMe you go to the DNA Relatives tool and do the same thing, except most of your matches will have chosen not to reveal their name and family tree, so you’ll need to send them one of the sharing requests I mentioned and hope they accept. I imagine the process on AncestryDNA is both similar to and different from the way it works elsewhere.

Discuss your background with your matches and find out what surnames, locations, or other details your families may have in common. You may find a connection immediately, or there may be nothing obvious. File all this information away for later because you never know when you or they will update their family tree and your connection will suddenly be staring you in the face.

What does a match mean anyway?

The simple answer is that they share a portion of your DNA, based on both of you having inherited that portion from a common ancestor. The significance of the match is generally evaluated in terms of four variables:

  1. How many segments? A person with you match five segments on five different chromosomes is likely to be a much closer relative than someone with whom you match one segment on one chromosome.
  2. How long is the match? You measure the length of a match by examining the start and end positions on the chromosome where a segment matches. A match may be, for example, from position 16 million to position 50 million on chromosome 12. The longer the match, the closer it generally is, but see below.
  3. How densely tested is the match region? This is reported as a SNP count, the number of consecutive polymorphisms you share with your match on a segment. The more SNPs tested on a matching segment, the closer it generally is, but see below.
  4. How variable is the genomic region where you matching segment exists? Fortunately you don’t have to calculate this yourself. 23andMe and FTDNA will give you a number to represent this value for your matches. The variability of the region, combined with the length of a match and the number of tested SNPs all combine to give you a number of centiMorgans (cM) representing the significance of your match. Researchers disagree on how many cM a matching segment should have to be useful for genealogy, but bigger is definitely better. 5cM and 7cM are common minimum cutoffs. Anything larger than 10cM is quite useful in my opinion.

Long Technical Digression

The detailed answer is much more complex. Feel free to skip this part. I’m skipping over some details but what I’ve described below is accurate enough for genetic genealogy.

Each of our DNA sequences is unique, unless you have an identical twin. Our DNA is composed of 23 chromosomes, and we all have two of each (except in cases like trisomy where an individual has a third copy of a chromosome). One copy of each chromosome is inherited from your father and the other copy is inherited from your mother. Chromosomes 1-22 are the autosomes, while chromosome 23 is the sex chromosome. Women have two copies of the X sex chromosome, designated XX, while men have one copy of the X and one copy of the Y chromosome, designated XY.

Now, when you inherit one copy of each autosome from your two parents, you don’t inherit an exact copy. The autosomes split and recombine. To give an example, you have two copies of chromosome 1. One copy may have only one third of the genetic sequence come from your father’s chromosome 1, with two thirds of your mother’s chromosome 1. But your other copy of chromosome 1 may then have one third from your mother and two thirds from your father. Which of those two copies your child inherits will determine how much they received on chromosome 1 from your mother versus your father. Repeat this over many generations, and sequences break up and rejoin repeatedly over time. Because of this, the fundamental unit of genetic genealogy is the “half IBD segment”, which means “half identical by descent”. The half signifies that half of the segment — the half from one of your chromosomes, but not the other — is identical to one of someone else’s chromosomes, and that the segments being identical is due to both of you having inherited them from a common ancestors. The alternative is an “IBS”, or “identical by state” segment, in which case you and this other individual happened to randomly inherit sequences that match, but did NOT come from a common ancestor. You can’t easily identify these false positives in advance, so some proportion of your matches will be type 1 errors like this. You won’t ever find that match.

It gets even more complicated though. The commercial testing companies generally do not phase your genetic data. Instead they report the results of your SNP test at a position from both copies of your chromosomes, but they cannot tell if a given sequence of consecutive SNPs came from copy A or copy B of your chromosome. This will also contribute to false positive matches. There are ways around this, and if you phase your data you will have much better results with genetic genealogy. To phase your data you need to have both of your parents tested with the same test you take. That will allow comparison of your father’s DNA to yours, and your mother’s to yours, and you will have a much more accurate vision of your DNA. There are tools online to automate the process for you (such as GEDMatch), but you need to have at least one parent tested. Two are even better.

Unlike the autosomes, the sex chromosomes (X and Y) are inherited nearly unchanged from each parent. With detailed Y-DNA testing you can compare your direct male ancestor line back thousands of years. My Y-DNA test helped confirm that my male line descends from Pierre Paradis (1604 – 1675), of Montagne-au-Perche, France, who immigrated to Quebec in 1651, even though my genealogy on that line hits a brick wall with my fifth great grandfather Henry H Paradis, born around 1847 in Riviere-du-Loupe, Quebec. See this link on Paradis history if you’re interested in the line.

For various reasons, particularly the fact that women inherit one X from their mother and one X from their father, the X chromosome is not as useful for genetic genealogy as the Y chromosome. It does not travel an unbroken line of the same sex like the Y does.

Mitochondrial DNA on the other hand is passed only along the maternal line. Whether male or female, you inherited it from your mother. Unfortunately mitochondrial DNA changes so slowly that even if someone has an exact match to your full mitochondrial sequence, that could still be 20 generations back and extremely difficult to find. My mitochondrial haplotype, U2e1* points to early European ancestry and then further back to the Indian subcontinent but this is somewhere along the lines of 5000+ years ago and not useful for what I’m trying to do.

Complicating this further, we’re all related to each other somewhere. The hope is that you find people related closely enough that you can identify your genealogical link. But if, for example, you are of European descent, there’s a better than 95% chance that you descend from Charlemagne, probably along several lines (he was my 38th, 39th, and 40th great grandfather — yours too). Or if you trace back to early Quebec settlers, then you are probably related to 95% of French-Canadians.

Step 4: Compare family trees

I believe AncestryDNA does this for you automatically which is a huge point in their favor. Otherwise you need to review your matches’ surname lists and compare them to yours to find your common link. Sometimes this is easy, if you’ve both done a lot of genealogy work, and sometimes it’s difficult, like if one of you was adopted or has large gaps in their tree, or simply hasn’t done much genealogical research. There are some third party ways to simplify this process which I will get to later.

Step 5: Share your ancestry information

The easiest thing to do here is make sure you fully fill out your user profile on the testing site you use. This will help your matches to do some of the matching work for you, and make them more likely to get in contact with you.

The best thing you can do, though, is upload your raw data to GEDMatch. This is a third party tool run by volunteers for free (they accept donations if you find it useful) that allows users from 23andMe, FTDNA and AncestryDNA to all put their data in one place so that you can compare across vendors. Otherwise you can never be sure if this one guy on FTDNA that you match also matches this one woman from 23andMe and so on.

I can’t reiterate enough how useful GEDMatch is, and how much you’ll help other genetic genealogists by uploading your data there. The service they provide is in many ways superior to that offered by the commercial testing companies. They also support uploading your GEDCOM and doing the family tree matching for you, but that feature is unavailable for now due to the huge influx of data submitted recently. It will be back someday. Once you’ve used it it is tough to do this work without it.

Step 6: Extend your tree

If you’re lucky you’ve been able to identify common ancestors with some of your matches by now. Look through their trees, and if they have any details about your ancestors that you don’t, add them to your tree. If they have the line traced back farther, extend the line in your tree. Add the other descendants of your common ancestors to your tree. You’re related to them, if only distantly, and having those surnames in your tree may help you track down your other matches.

I’ve confirmed via paper genealogy matches as close as third cousins and as far back as ninth cousins. I have documented ancestors going back to early New World settlers so that means I have a LOT of matches and finding the link with other people that have old confirmed lineages eventually gets quite easy. But there are many more people who descend from these early settlers than there are people that can document their ancestry back to them, so sometimes it can be frustrating.

My easiest matches go back to colonial days in the US, particularly some of the early Connecticut settlers like Eleazer Beecher and Phebe Prindle. Early Quebec settlers like Nicholas Pelletier and Jean de Vouzy are another great source for confirmed matches. I also have some large clusters from early French settlers in Louisiana, as well as Quebec French who immigrated to Louisiana later.

As a reference point, I am sharing with nearly 100 matches on 23andMe. I have confirmed genealogical ancestry with somewhere around ten of them. Your results will vary. One of my most recent matches had a detailed family tree and I found our ancestors in 1780s Louisiana after only about ten minutes of work. I was the first person she shared with, so while I only have a 10% success rate she’s at 100%.

Step 7: Triangulate!

The only way to do this is to share with as many people as possible on 23andMe, manually collate your matches from FTDNA or use GEDMatch. Share with people even if you see no obvious connection besides your matching segment. As you accumulate matches, you will eventually discover multiple people that you match in the same region of the same chromosome.

Once you have a list of two or more people you match in the same region, compare them to each other. If you match person A at a particular region, and you match person B at the same spot, compare A to B. If they match each other at the same spot, congratulations. All three of you very likely share a common ancestor. If A and B do not match each other, then most likely you match A on the copy of the chromosome you inherited from your mother and you match B on the other copy, inherited from your father, so that can help you track down the common ancestor you have with each, even though A and B are not related.

Where it gets really interesting is when you have a cluster of several people that all match you and each other but stubbornly resists identification. Then you find a new match who matches all of them, and you find your common ancestor with this new match based on the quality of their genealogical research. That allows you to positively assign a spot in history to the rest of your cluster and may help with future identification. This was the case for me with the recent Louisiana match I mentioned. This match was on a cluster including a woman in Italy that had only one known ancestor who went to the US. We were quite sure our match was somewhere along this American immigrant’s line, but since my new match places a portion of this segment in 1788 Louisiana, that means my match with the Italian woman is back older than that, likely somewhere in France, Germany or Luxembourg in the 1600s or earlier, based on the ancestors of this specific Louisiana settler family.

I’m planning another blog post later on ways to leverage the clusters you’ve identified using 23andMe’s Ancestry Finder tool and GEDMatch. The method will be obvious to anyone who has done this a while but I haven’t seen anybody wrote it up yet.

Additional Resources

Here are links to the companies and sites I’ve mentioned along with a few other reference materials on genetic genealogy.

Disclaimer

Other than the 23andMe referral link, I have no employment relationship with any of the sites mentioned or linked, nor have I received any compensation for this post. I am a happy user/member/reader of many of the sites and I will get only the indirect benefit of having your DNA tested and potentially matched to mine.