Pages

Tuesday, June 24, 2014

Biases in Genetic Models that are generally overlooked


Before continuing with the Y chromosome C haplogroup and its peculiar global range, I want to dedicate today's post to "Models" in general (and those used in genetics in particular), and how they shape the way we believe things work.


As an engineer I have a clear notion that a model is a simplification of the real world, we make assumptions and find that models allow us to make predictions that are reasonably similar to the real world. This allows us to build a bridge that will not collapse and yet use a minimum amount of steel, or shoot a cannon ball from A and hit a target in B. The models can be more complex (relativity is taken into account to make your GPS work correctly and quantum mechanics in all things electronic), but all of them are just that: a "model", an approximation to reality, not reality itself, merely a simplification of the real world.


Models and Genetics a critical overview


Let's review some causes of errors and the hidden biases in genetic models and methods:


Populations


Some studies use strange populations such as "MXL, Mexican Ancestry from Los Angeles USA" or "ACB, African Carribbeans in Barbados", [10] which are far less significant than a native aboriginal individual in his or her homeland: such as an "Alakaluf from Magallanes, Chile" for instance.


What conclusions can be reached on the peopling of America by looking at the genome of a person living in LA, of Mexican ancestry? Mexico has several native groups, later overalaid by Spanish conquistadors (Spanish are themselves a mixture of many ethnic groups... aboriginal Iberians, Basques, Celts, Romans Carthaginians, Arab invaders, Goths, etc.), the African slaves they brought over to work in their plantations and a touch of other European and Southern and Mesoamerican groups. In other words the Mexicans are very admixed population. MXL are actually irrelevant or former slaves of African origin living in Barbados!


Then we have studies that ignore the New World completely. Samplings cover populations in Africa and Eurasia (sometimes including Australia and PNG), seldom the Americas. This of course limits the usefulness of these studies and maybe conceals interesting findings that will remain ignored until American samples are contrasted with Old World ones.


The Sampling within the populations


Once the populations have been identified (with all the caveats mentioned above), its ancestry is studied by means of a sample of individuals that in theory (but not in practice) is drawn from it in a random manner.


In other words a sample of size "n" is taken from a population with a size "N". In most populations "N" is several orders of magnitude lager than "n" (imagine a sample of 200 Italians from Tuscany out of the 61 million people living in Italy). This implies that we may have a sampling bias, and leave out some unique or even critical genetical sequences that appear at low frequencies in a given population.


In other cases the sample is not a random one (Ascertainment bias); for instance, in the case of small tribes: the sample "n" is small because "N" is also small and perhaps encompasses a clan or family group. Thus diversity is low and, as we will see below, calculations based on these samples will be affected by this sampling bias.


Ascertainment bias also arises when researchers take samples from databases in which some populations are missing, or some regions are under represented. Or from samples obtained from volunteers (i.e. a University class) which are definitively non-random samples.


The Typing of Haplogroups in those samples


Once we have sample of our population, then we sequence the Y chromosomes (or in the case of mtDNA, the mitochondrial DNA) for known markers (remember this: known, we will get back to it later) and type the individuals based on these markers. We don't read the whole sequence of 60 million base pairs in a Y chromosome and compare them all. Instead we choose certain ones which we believe are markers and base our analysis on these markers. This introduces another ascertainment bias: the choice of markers is not random.


Look at it this way, we have a million books printed in different languages, we take a sample of fifty from the pile of books and compare them based on a few of the words printed in them: If we find the word "Chapter" we will place them in the English group, "Capítulo" in the Spanish one, "Haupstück" in the German one. All the other words are ignored. We may even find a book printed in Armenian that by chance carries the word "Chapter" and we will place it in the "English" group and ignore that an "Armenian" group even exists. We may have not sampled a Chinese book and also ignore its existence.


Yes, the analogy is faulty (well it is a model after all), the markers we look for in genetics are placed in specific locations within a chromosome. We compare the markers in specific positions with each other to define similarity. In the books example the analogy falls through because the word Chapter can appear on any page and not on a specific page. Yet, what I wanted to point out is that even though there are millions of base pairs in a chromosome, we only use a handful to define haplogroups.

Let's look at this in depth:


Y chromosome SNPs, the Haplogroups


First of all, the Y chromosomes are sequenced and the Single Nucleotide Polymorphism (SNP) that identifies a given haplogroup (hg) is identified. An example of an SNP is shown below:


  • AAGCCTA - ancestral
  • AAGCTTA - derived

The fifth nucleotide, the "C" in the ancestral variant, mutated to a "T" in the derived variant. This SNP ocurrs in a specific location. These mutations are assumed to be apparently rare, and once they happen, they remain in the DNA and are inherited by all descendants of the mutated individual. This allows us to build trees based on which SNPs are present in the populations.


The SNP that identifies the whole Hg C is marker RPS4Y711. The different haplotypes also have their own markers: M8 (for C1), M38 (for C2), M217 (for C3), M93 (for C3a), and so on.


But, as I mentioned in a previous post, it is possible that these SNP mutations revert (in that post I mention the Motala remains sequenced for a Q haplotype which lacked a key marker but had all the others), so this may also introduce additional errors.


Haplotypes and Paragroups


So, in our sample which already may have a sampling bias, we will find that most individuals will belong to a given haplogroup (i.e. Y chromosome hg C) and to a given haplotype (i.e. if they tested positive for marker M38 we will be confident that they belong to the C2 haplotype).


But... Some cases will not test positive for the known markers (i.e. M8 -excluding C1, M217 -excluding C3, M347 -excluding C4 or M356 -excluding C5), so we will assume that they form a paragroup that includes all C-other lineages, and we will identify it as C* (with an asterisk).


So these pargroups clump together a large variety of haplotypes for which we have not yet identified specific "unique" markers which would set them apart as new haplotypes.


Hidden diversity


This is not a trivial matter. For instance paragroup C* is found all across the Eastern edge of the Old World, from Australia, across Indonesia, China, Japan and India, to Bering at frequencies that range from 0.3% to 10% of the population. (some are even higher: C3* among Mongols reaches a frequency of 18% [6]).


To assume that they all belong to the same group is erroneous, the paragroup masks subclades (haplotypes) which have not yet been discovered: Maybe C* in Australia is a not yet identified C8 hg, while C* in Central Asia is a yet undiscovered C9 hg for example. In other words, there is hidden diversity out there waiting to be discovered.


Bias in the choice of the markers


The SNPs are not chosen in a random manner, they are defined by geneticists to type haplogroups. This means that the real distribution of polymorphisms in a population may differ from those shown in a study. The reason is simple: The genotyping arrays (or chips)used to identify the markers contain biased sets of pre-ascertained SNPs. These SNPs tend to be older than the majority of the SNPs in a given population, and are found in many populations besides the one being studied. These hand-picked SNPs act as sieves, classifying the samples and causing alterations such as "[shifts in] allele frequency distributions ... towards intermediate frequency alleles", furthermore, "estimates of linkage disequilibrium are modified" [11].


In other words, this increases the frequency of the most commonly polymorphic loci and eliminates other markers (loci that are less polymorphic in the screening panel). This ascertainment bias in the SNP arrays strongly skews the estimates of genetic diversity by ignoring those that are not included in them.


Variety within the Haplotypes


Once the haplogroup and haplotype have been identified, we can take a look at the Microsatellites to check out even further diversity. Microsatellites are repeats (2 to 6 nucleotides long) that repeat "n" times ( n= 5 to 100). An example: the sequence "AT" repeated 25 times in a row (this is expressed as follows: (AT)25).


These microsatellites are found across species and mutate quicker than single point mutations (SNPs) and for this reason they are used as markers to define the subclades within haplotypes. An example would be a mutation in (AT)25 to (AT)24 or to (AT)26.


In the case of our Y chromosome, we will use special microsatellites known as Short Tandem Repeats (STR). These are named as "DYS-Number" (i.e. DYS393) which indicates a position for the STR. The STR, is a series of repeats of a dinucleotide (two nucleotides).


Below is a real example, for haplogroup C, for four individuals, two from Colombia, one Korean and a Kalkh:


genetic sequence
A real set of STRs for four individuals. Copyright © 2014 by Austin Whittall

We can see in the image above that some DYS markers differ in the quantity of repeats (they are shaded pale blue and yellow).


The Evolutionary sequence


Now come the questions: Does the Korean derive from the Colombians or is it the other way round? What about the Mongolian? Note the differences at DYS392 and DYS393 between Koreans and Colombians (in yellow) and the difference between Colombians and Mongolians at the other DYSs (pale blue).


Without an A priori theory you can't answer that question just by looking at the repeats. Some additional assumptions are necessary, and computer programs are used to build the phylogenetic trees that link these individuals.


We could assume that humans came from Asia and peopled America, so the Americans are more recent than Asians. And as Koreans are Asians, they predate the Colombians so they accumulated mutations, passing from 11 to 12 in DYS392 and 13 to 15 in DYS393. So far so good, but then we have a Khalkh, from Mongolia, who are also Asians, but have less accumulated mutations than the Colombians or the Koreans. This way of comparing STRs is faulty.


Building a Phylogenetic Tree


The individuals are placed on phylogenetic trees using other assumptions that consider the differences between individuals, and are based on the different STRs, but it uses a different reasoning process to the one used above.


Distorting elements


Nevertheless we must remember that each DYS may have its own mutation rate, so if there are several DYS that differ, then they all have to be considered to calculate the "distance" between individuals.


An additional complication is that the "real" evolutionary history of any given set of individuals may differ from the "inferred" evolutionary history. As can be seen in the following image where only three (3) mutations out of twelve (12) are detected during analysis. The other nine (9) are ignored and remain undetected. This affects the estimations on divergence since the mutations are underestimated and the split time between the ancestor and its descendants is underestimated:


missing mutations
How mutations are Underestimated. Adapted from Fig. 2 in [4]

So "corrections" are introduced such as the Jukes-Cantor Model [4] which fiddle with the equations used in the models to make them fit better to reality.


So, how are the differences calculated?


Computer algorithms add even more assumptions


Algorithms are used. They are run on computers and compare the individuals in a pair-wise manner. There are several algorithms, each with their pros and cons. The two basic classes are:


  • Distance-Based Methods - Neighbor Joining (NJ), which initially form an unresolved star-like tree and compare the branch length sum in a pairwise manner. It then groups as "close" those that have the minimum sum. This pair is linked in a branch and then the process begins again, iterating until all individuals have been grouped.
  • Maximum Parsimony method, uses certain features (substitutions in the sequences) to work out a most likely evolutionary relationship among individuals. It builds the tree using the least substitutions chain from the common ancestor to the individuals being located on the tree. So it scores each possible option and minimiizes the mutation number to buid the tree.

The trees are then rooted by comparing them with some outgroup species (ie. chimpanzees are used for human and hominin comparisons). Of course this requires the assumption that molecular clocks are valid and that the divergence date with the outgroup species is well known (more on this below - see clock ticking out of time).


An example: Batwing


Batwing [8] (acronym which stands for Bayesian Analysis of Trees with Internal Node Generation), is a widely used computer program for analysis of genetic data. It has some implicit assumptions that I list below which are not mentioned in the papers that use it, but which impact on the outcome of the program's analysis. By the way, the authors of the program clearly point out that "Natural populations are unlikely to satisfy BATWING's modelling assumptions" [8].


  • The data is a random sampling from the population (we have seen above it is not usually the case)
  • The population is panmitic (not frequent in human groups)
  • Splitting between populations is instantaneous (actually it takes plenty of time)
  • There is no subsequent migration events between populations (there is always posterior admixture due to migrations between populations that have split)

Batwing uses different mutation models, but the default setting is the Stepwise Mutation Model or SSM, which we analyse in detail below.


Comment, the TMRCA (time to most recent common ancestor), Ť, is calculated under the Simple SSM model using the expression: Ť = Δ ⁄ μ.


Where Δ is the average squared difference in the number of repeats between all sampled Y chromosome and the founder haplotype, averaged over STR loci, and μ is the Simple SSM mean mutation rate per generation averaged over loci. But, if, as we will see below SSM is not very reliable, then how can clade age estimates be reliable?.


The Stepwise Mutation Model


This Stepwise Mutation Model (SSM) [2] was proposed in 1973 by Ohta and Kimura and has been widely adopted as the model for microsatellite evolution:


Microsatellites are believed to evolve neutrally: natural selection does not influence the number of repeats so, the SMM premise is: "In one generation the repeat number can only increase or decrease by one, and the probability is equal".


But this assumption is not exactly so for several reasons:

  • Actually, the probability of mutation is larger for longer microsatellites [1][3].
  • A long set of repeats (n larger than 20) may cause physical instability in the microsatellites and hamper its further growth, actually leading to their contraction. [3][2]
  • Some microsatellites are interrupted and have lower mutation rates.
  • The repeat unit also influences mutation rates: dinucleotides mutate slower than tetranucleotides.
  • The motif of the dinucleotide (i.e. TG vs. TA) also plays a role: certain motifs are much longer than others.
  • Variable mutation rates (those that change the repeat by more than 1) are not uncommon and happen about 15 to 22% of the time [3], in other words, the model ignores a big chunk of mutations.

Add to this that insertions or deletions next to the microsatellites also influence their lenghts. [3]


Even the "neutrality" of satellites is questionable since some repeats take place in promoter regions and may influence protein building [3] and thus be subject to natural selection. Some microsatellite repeats have been linked to certain diseases (myotonic dystrophy, Huntingtons' disease, etc.) making their neutrality doubtful too.


There are also "point mutations" that interrupt a repeat; an example: (AT)18 may suffer a chance point mutation where "A" mutates to "G" in position 10, causing the new sequence to be: (AT)9 GT (AT)8. Transormation which may go undetected in sequence analysis, altering the mutation rate estimates.


Panmitic populations


Last but not least, the SMS model assumes that the individuals come from a random sample form a single panmitic population of constant size "N", and this is not the case. [5] Application to expanding populations or those with mixing due to migration may provide different results.


A panmictic population allows random mating without any restrictions of any kind (due to age, genes, behavior, social, environment, etc.), which is seldom the case in human populations, past or present.


Migrations


When an ancestral population splits into two groups, they are subjected to two opposing processes (see image below):


drift and mutation in splitting populations
How genetic drift and migration affect allele frequencies. Copyright © 2014 by Austin Whittall

  • Genetic Drift. It arises because Ne (number of effective breeders) which contribute their genes to the next generation is smaller than the total population, so they pass on their genes only and since this is a random sampling process, the frequency of these genes will differ from that of the previous generation. The smaller Ne, the larger the drift. This effect accumulates with each successive generation and separates the diverging subpopulations as time passes.

  • Migration. Exchanges between the populations as they diverge will limit the drift, keeping them similar. The proportion of migrants "m" if larger will have a higer impact on stability.

Comment on drift and lack of migration: The "Beringian Standstill" was invented to justify the strangely unique American haplogroups, completely absent in the purported Asian homeland of the Native Americans.
The Standstill theory first suggested by Bonatto & Salzano (1997) and perfected by Tamm et al., 2007, is based on a one-in-a-million "Founding effect" that isolates a group of "founding fathers" in Beringia, cut off from their Asian relatives and from the vast empty Americas by ice sheets for about 15,000 years. During this long period of time they mutated their Asian mtDNA and NRY haplogroups into new ones and then in a quick wave covered America swiftly so as not to allow new diversity to arise.
Furthermore, their Asian relatives all died off, leaving no trace on the Asian side of Beringia.
Yes, I know it sounds improbable, yet even though the odds are against this kind of event, several papers apart from Tamm et al, support the theory.


But let's get back to the Stepwise Mutation Model: it is quite weak, to put it mildly.


Summary: SMM is unreliable


All these phenomena make SMM a very rough approximation to reality, yet it is used as if it was 100% reliable!


Just as an example of this lack of reliability is the quote below (Nebel a., et al., 2001) [7]:


"the behaviour of DYS388 appears to be inconsistent with the SMM, as was shown in two populations of Middle Eastern origin. Additionally, another widely used microsatellite, DYS392, has recently been demonstrated to deviate from the SMM" [7]


The Clock that ticks out of time


I have posted on the useless genetic clocks in the past, so I will not bore you, just highlight my previous objections to clocks:


Divergence from Chimps. Scientists devise clocks to calculate mutation rates. To do so we estimate the divergence dates of the human line from the chimpanzee line. But the date of this event is uncertain, and has been increasing since the 1970s from an estimated 5 Mya to 6 - 7 Mya, and in June 2014, to 13 Mya [9], this recent change should surely impact on the dating of human origins!


Assumptions are also made regarding the duration of a generation (what can we know about how long a generation was 50 kya? did females mature earlier or later? what about males? was it 27 years or 35?). Population sizes and their trends (expansion, migration, admixture with other groups as well as bottlenecks and founder effects) also should also be factored in.


When those clocks are calibrated against real mutation rates measured in (again a discrete sample) familes over the last few hundred years strong discrepancies arise: these family (pedigree) calculated mutation rates usuall differ from the former ones (evolutionary). But these differences remain unexplained in the papers. They merely show both figures but avoid explaining the causes (i.e. the mutational clock does not tick at a regular pace).


The mutation rates are also calibrated against the estimated dates for the peopling of certain regions based on the information provided by archaeology. i.e. 40 kya for Australia or 17-20 kya for America. However just by looking at the published error margins we can see the uncertainty involved in these calculations.


Diversity is taken as an indicator of antiquity so if Region X has a large variety of haplotypes while Region Z has fewer, population in Z is assumed to be younger. But, actually what happens is that if the people in "Z" are a subset of population from Region "X", the fact that they are a subset means that they will have less diversity than the original group. This does not mean that they are more recent, it means that they left certain genes behind. Add to this the pressure of natural Selection (and chance i.e. genetic drift) and the genes of certain individuals within the subset at "Z" will get lost too. So if we measure "X" against "Z" by their diversity we would incorrectly judge "Z" to be more recent, when they are really just as ancient as "X".


Frequency, Migrations and antiquity


Often the current distributions of haplogroups occur at differing frequencies in certain territories. What does this mean? That the less frequent "A" hg is a recent arrival of a small group carrying it, entering the territory of the prevailing "B" hg.? Did "A" exist in the same population as "B", but in very low frequencies, and those have been maintained or even decreased?


Or is "A" an ancient colonizer that suffered attrition over thousands of years and has been gradually losing ground to better equipped newcomers with hg. "B"? Maybe "A" and "B" were found in equal proportions in the original colonizers but "B" grew due to genetic drift or natural selection...


Questions like those are seldom asked or answered in mainstream papers. It is clear that the choice of the correct answer requires an in depth analysis which is not found in the academic literature (I have read tens of papers and these matters are not even addressed).


The low diversity among Amerindians is always invariably attributed to a founder effect or a bottleneck during the peopling of America event. The massive death of millions of Natives (virtually a genocide) during the process of discovery and conquest of the New World between 1492 and 1560 is ignored. Disease and war acted selectively wiping out tribes without leaving a trace of them, but this issue is simply ignored and the "lack of diversity" is assumed to be due to the original peopling event some 15 kya.


Dogma

The unidirectional migratory route from Africa to the World has some inconsistencies which can only be explained by back-migrations. These into Africa migrations are reluctantly accepted by orthodoxy but, fortunately, are gradually altering the OoA picture with a more parsimonious explanation. Sometimes I get the feeling that OoA is supported because it is politically correct and assuages the guilt complex of the Western world for the tragic crimes of Slavery and colonialism perpetrated against Africa.


Two issues requiring a serious review are: The East Siberian void of putative ancestors to the Amerindians, which remains unexplained and The "Beringian standstill" justification for Amerindian uniqueness, which also requires a critical analysis due to its improbability.


Closing comments


What I have tried to express in today's post is that there are many assumptions underlying the "facts" expressed by mainstream geneticists regarding human diversity and evolution.


Models are simple representations of reality and not reality itself. They should be taken as such and not as truth written in stone.


Algorithms and simulations run on models are only as reliable as the models they are based on. And we have seen the flaws in some of these models and programs. Flaws that introduce errors in their output yet are not explicitly mentioned in the papers that basethemselves on them.


The complexities in the statistical assumptions mentioned in papers (those pages or paragraphs, full of equations that you skip when reading a paper) mask some very evident biases that skew the results and produce patterns that do not correctly reflect reality, which is richer and much more varied than what these papers show us.


Sources


[1] Esra Ruzgar and Kayhan Erciyes, Phylogenetic Tree Construction for Y-DNA, Haplogroups.
[2]Amke Caliebe et al., (2010). A Markov chain description of the stepwise mutation model : Local and global behaviour of the allele process. Journal of Theoretical Biology 266(2010)336–342
[3] Peter Calabrese and Raazesh Sainudiin, (2004) Models of Microsatellite Evolution
[4] Yan Li, Phycs498BIO Assignment 2, How to Build a Phylogenetic Tree
[5] Valdes, Ana M. Slatkin M. and Freimer N., (1993). Allele Frequencies at Microsatellite Loci: The Stepwise Mutarion Model Revisited. Genetics 133: 737-749 March 1993
[6] Boris Malyarchuk, et al., (2010). Phylogeography of the Y-chromosome haplogroup C in northern Eurasia. Annals of Human Genetics (2010) 00,1–8 doi: 10.1111/j.1469-1809.2010.00601.x
[7] Nebel A., et al.,(2001). Haplogroup-specific deviation from the stepwise mutation model at the microsatellite loci DYS388 and DYS392. Eur J Hum Genet. 2001 Jan;9(1):22-6
[8] Ian Wilson, David Balding and Mike Weale, (2003), Batwing User Guide. See pt. 1.2.
[9] Oliver Venn et al., (2014). Strong male bias drives germline mutation in chimpanzees. Science 13 June 2014: Vol. 344 no. 6189 pp. 1272-1275 DOI: 10.1126/science.344.6189.1272
[10] www.1000genomes.org.
[11]Lachance J, Tishkoff SA. et al., (2013). SNP ascertainment bias in population genetic analyses: why it is important, and how to correct it. Bioessays. 2013 Sep;35(9):780-6. doi: 10.1002/bies.201300014. Epub 2013 Jul 9.
<(p>


Patagonian Monsters - Cryptozoology, Myths & legends in Patagonia Copyright 2009-2014 by Austin Whittall © 

No comments:

Post a Comment