Lecture Notes on Gene Genealogies1
Alan R. Rogers
February 6, 2023
1
©2009, 2010, 2013 Alan R. Rogers. Anyone is allowed to make verbatim copies of this document and also to
distribute such copies to other people, provided that this copyright notice is included without modification.
,2
,Contents
1 Descriptive Statistics for DNA Sequences 5
1.1 DNA sequence data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 The Method of Maximum Likelihood 11
2.1 Maximum likelihood exercises with genetics problems . . . . . . . . . . . . . . . . . . . . 11
3 Genetic Drift 13
3.1 The four causes of evolutionary change . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 What is genetic drift? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 The Wright-Fisher model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Classical theory of homozygosity and heterozygosity . . . . . . . . . . . . . . . . . . . . . 15
4 Gene Genealogies 19
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Coalescence time in a sample of two genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Coalescence times in a sample of K genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 The depth of a gene tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.A A more detailed treatment (optional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.B The mean of an exponential random variable (optional) . . . . . . . . . . . . . . . . . . . . 28
5 Relating Gene Genealogies to Genetics 29
5.1 The number of mutations on a gene genealogy . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 The model of infinite sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 The number of segregating sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 The mean pairwise difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Theta and Two Ways to Estimate It . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.A The probability that a nucleotide site is polymorphic within a sample . . . . . . . . . . . . . 34
5.B When you assume the model of infinite sites, how wrong are you likely to be? (optional) . . 35
3
, 4 CONTENTS
6 The Site Frequency Spectrum 37
6.1 The empirical site frequency spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 The expected spectrum under neutrality and constant population size . . . . . . . . . . . . . 39
6.3 Human site frequency spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 The Mismatch Distribution 45
7.1 The observed mismatch distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.2 The expected mismatch distribution under neutral evolution with constant population size . . 46
7.3 Coalescent theory in a population of varying size . . . . . . . . . . . . . . . . . . . . . . . 47
7.4 The coalescent as an algorithm for computer simulations . . . . . . . . . . . . . . . . . . . 47
7.5 Stepwise models of population history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.6 Simulations of stationary populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.7 Simulations of expanded populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.A Point estimators for expanded populations (optional) . . . . . . . . . . . . . . . . . . . . . 59
7.B Statistical properties of point estimates (optional) . . . . . . . . . . . . . . . . . . . . . . . 59
A Mean, Variance and Covariance 65
A.1 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.3 Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B Answers to Exercises 71
Alan R. Rogers
February 6, 2023
1
©2009, 2010, 2013 Alan R. Rogers. Anyone is allowed to make verbatim copies of this document and also to
distribute such copies to other people, provided that this copyright notice is included without modification.
,2
,Contents
1 Descriptive Statistics for DNA Sequences 5
1.1 DNA sequence data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 The Method of Maximum Likelihood 11
2.1 Maximum likelihood exercises with genetics problems . . . . . . . . . . . . . . . . . . . . 11
3 Genetic Drift 13
3.1 The four causes of evolutionary change . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 What is genetic drift? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 The Wright-Fisher model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Classical theory of homozygosity and heterozygosity . . . . . . . . . . . . . . . . . . . . . 15
4 Gene Genealogies 19
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Coalescence time in a sample of two genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Coalescence times in a sample of K genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 The depth of a gene tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.A A more detailed treatment (optional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.B The mean of an exponential random variable (optional) . . . . . . . . . . . . . . . . . . . . 28
5 Relating Gene Genealogies to Genetics 29
5.1 The number of mutations on a gene genealogy . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 The model of infinite sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 The number of segregating sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.4 The mean pairwise difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Theta and Two Ways to Estimate It . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.A The probability that a nucleotide site is polymorphic within a sample . . . . . . . . . . . . . 34
5.B When you assume the model of infinite sites, how wrong are you likely to be? (optional) . . 35
3
, 4 CONTENTS
6 The Site Frequency Spectrum 37
6.1 The empirical site frequency spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 The expected spectrum under neutrality and constant population size . . . . . . . . . . . . . 39
6.3 Human site frequency spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 The Mismatch Distribution 45
7.1 The observed mismatch distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7.2 The expected mismatch distribution under neutral evolution with constant population size . . 46
7.3 Coalescent theory in a population of varying size . . . . . . . . . . . . . . . . . . . . . . . 47
7.4 The coalescent as an algorithm for computer simulations . . . . . . . . . . . . . . . . . . . 47
7.5 Stepwise models of population history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.6 Simulations of stationary populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.7 Simulations of expanded populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.A Point estimators for expanded populations (optional) . . . . . . . . . . . . . . . . . . . . . 59
7.B Statistical properties of point estimates (optional) . . . . . . . . . . . . . . . . . . . . . . . 59
A Mean, Variance and Covariance 65
A.1 The mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.3 Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B Answers to Exercises 71