Big Data in Biomedical Science
Week 1
Lecture: what is data
Volume, Variety, Velocity and Value
Big data is getting bigger and cheaper, availability of data is more and there is fast growing computer power
these days, making it attractive to have big data. Thus: there is more data, after the accumulation there needs
to be analysis and hypothesis driven research has limitations to it. This is why big data is getting more and more
popular.
Week 2
Lecture 3: Genetics
In the past decades there was a rapid change with new technologies, methods, large scale collaborations and
novel disease insights.
There still are issues in human genetics.
- Relative influence of G(enes) and E(nvironment) still under debate.
o We need more reliable estimates + overview across age, sex, population
- Nature of G influence still under debate
o Do they all act in additive way or is there an non-additive functionality?
- Determining causal mechanisms
o This is a challenge to polygenic traits. We still have questions like how do we detect and
interpret the findings?
Twin studies were largely used to determine relative influence of genes and environment. Both twins used:
- MZ (monozygotic): share 100% of genes, 100% of shared environment, 0% non-shared environment
- DZ (dizygotic): share average of 50% of genes, 100% of shared environment, 0% of non-shared
environment
Lets say a trade is….
100% heritable and additive: similarity in MZ twice as high as in DZ twins:
o 2Rmz= rDZ
100% environment influences differences between individuals: similarity between MZ and DZ in
more or less similar (because their environment was 100% the same):
o RMz=RDZ
Posthuma wanted to do a meta-analysis to combine all the various twin studies done between 1900-2012. They
extracted: sample size (N) and correlation (r) from MZ and DZ. And they also extracted estimated influences of
genes (h2) and influences of environment (c2). If they could, they extracted separately for male/female, four age
groups and populations. Then they standardized trait classification, because every study defined IQ as
something different.
Almost 3000 twin studies done from 1958 until 2012.
Reported 17 804 traits
Total sample size was 14,5 million twin pairs (dependent)
They combined all of the traits together. The question they wanted to answer was: how heritable is any
measurable trait in a person?
The estimated heritability for any trait that you can measure in MZ twins, this was 0.64 and in DZ, this
was 0.34.
Trait of interest is any trait that we can measure: The estimated heritability of that trait is almost 50%.
So why we are different in those genes, is because we differ 50% in genes. Only 17% was explained by
shared environment.
Main conclusions:
- All traits heritable to some extend
- Influence of c (shared environment) is relatively small
, - Majority of traits are consistent with a model where all genetic variance is additive.
Heritability
- Proportion of trait variance attributable to genetic variance
- The extend of which observed individual differences can be traced back to genetic differences
- Causes genetically related individuals to correlate on a trait
- Suggests that variations in genes underlie trait differences between individuals
Two important discoveries in genome:
- Structure of DNA (1953)
- Sequencing of the human genome (3*109 base pairs) (2002).
DNA facts
23 sets of chromosomes in each cell
Chromosomes are transmitted from parent to off spring via meiosis
Each single chromosome is a DNA-molecule
DNA molecule consists of nucleuotides (ACGT)
DNA is a double helix with A+T and G+C
Codon: three bases. Transcription and translation make up amino-acids.
o Multiple codons code for the same aminoacid: DNA is robust against errors.
Codons between start and end sites: genes, provide blueprints for proteins
One chromosome consists of non-genic (90%) and genic regions (10%)
Humans have around 22-24 thousand genes
Not every gene is expressed in a cell
Specific set of genes expressed in a cell, determines its cell type
We share 87,5% of our DNA with a mouse, 99% with a chimpanzee and 99.9% with an individual. A million sites
in our DNA differs between individuals, which results in phenotypic differences.
Genetic variations, SNPs can occur:
o In gene: protein coding, regulatory region, exonic, intronic
o Outside genes: regulatory or of unknown function.
They can be:
o Harmless (small, no harmful change in phenotype)
o Harmful
o Latent: dependent on another factor
o Silent
Causes of variations:
o Mutation: level of base pairs
o Recombination: level of parts of the chromosome
o Segregation: level of combination of chromosomes
Monogenic vs polygenic disorders
Monogenic: influences by one gene. Most genetic causes already known.
Polygenic disorders/traits: influenced by > genes, of which the causes are mostly unknown. They are
often very complex, because they are caused my multiple genetic and environmental factors with
possible interaction.
Why do we want to find genetic variants linked to disease?
- Novel biological insights clinical advances therapeutic targets, biomarkers, prevention
- Improved measures of individual aetiological processes personalized medicine diagnostics,
prognostics, therapeutic optimization.
Association study design: we have a control group, 5% has variant X. we have a case group, 21% has variant X.
Candidate gene studies (until 1990): preselect several genes based upon knowledge and convenience. Then you
test for association with a trait.
Genome Wide Association Studies:
Microarrays can contain more than 1 million tagging SNPs covering the genome in high density.
, Strategy:
- Genotype of large set of individuals (cases+controls) on ~1 million SNPs.
o We don’t need to do all 3*109 SNPs, since close by SNPs are often correlated
- For each SNP: compare allele frequency across cases and controls, conduct a statistical test for a
difference in frequency.
The first outcome of GWAS is a Manhattan plot.
- Every dot: association test of one single SNP in a genome
Advantages:
- May identify several possible loci as spans whole genome
- Relationships between loci may identify new biological pathways
- Results from multiple studies can be integrated, aiding prioritization of genes for replications and
increasing statistical confidence
Disadvantages
- Increased likelihood of false positives
- Population stratification
- Large number of samples needed
- Vast amounts of data analysed and produced, need cluster computers.
The initial sample sizes for GWAS were 5 to 10 thousands, where they actually needed 100 thousands or
millions to ultimately allow the detection of genetic variants with very small effects.
Schizophrenia:
- N= 3300 cases/3500 controls: 2 risk genes found in whole genome
- N= 9000 cases/12 000 controls: 7 risk genes found in whole genome
- N= 14 000 cases/18 000 controls: 22 risk genes found in whole genome
- N= 40 000 cases/50 000 controls: 131 risk genes found current largest GWAS of Schizophrenia
o However: these genes explain <3% of the liability of schizophrenia
o However: risk associated to 8300 genetic variants explains ~32% of liability, so this includes
variants found in the study but not statistical significant
Huge sample sizes were needed and currently reached. GWAS results is found to be very reliable, many genes
now are discovered. GWAS of their own group had 1 million individuals.
Many complex traits are highly heritable. GWAS has detected some genetic variants, but as is said, this is a very
small portion of the whole genetic variance (<2%).
- Effect sizes are small, so large samples are needed
- No functional information is gotten: most findings are outside of the genes and not inside.
The majority of human complex traits are probably caused by thousands of genes of very small effect.
How do we biologically interpret GWAS results? gain mechanistic insight
- GWAS provides associated variant genes, yet no functionality of the variant
- Current GWAS results are difficult to interpret, complicating the formulation of hypotheses that can be
tested in functional experiments
There are 4 issues with GWAS hits for polygenic traits:
1. They are mostly outside genes or in non-coding genic regions, with likely regulatory functions that are
currently unknown
2. They have small effects
3. SNPs are correlated, which complicated pinpointing the causal SNP
4. There are 100s of genes involved in polygenic traits: a single gene will not provide the whole picture
What about the SNPs that are correlated?
GWAS first output is a Manhattan plot. A loci is a peak of the plot. When we zoom in on a locus, there are
multiple SNPs found on one locus that are all statistically significant. The top SNP is a diamond in the plot, but
it’s not most significant. The colour of the other SNPs denote how strongly they are correlated with the top SNP.
When they are closely located to this SNP, the genotype is correlated, meaning a very small p-value. In the area
there are many genes. If there would have been only one, it would’ve been easier to say which gene is
associated with which phenotype. However this is not the case, because multiple genes and SNPs are found in
the locus, it is hard which one is the one that is most likely to cause the phenotype?
Week 1
Lecture: what is data
Volume, Variety, Velocity and Value
Big data is getting bigger and cheaper, availability of data is more and there is fast growing computer power
these days, making it attractive to have big data. Thus: there is more data, after the accumulation there needs
to be analysis and hypothesis driven research has limitations to it. This is why big data is getting more and more
popular.
Week 2
Lecture 3: Genetics
In the past decades there was a rapid change with new technologies, methods, large scale collaborations and
novel disease insights.
There still are issues in human genetics.
- Relative influence of G(enes) and E(nvironment) still under debate.
o We need more reliable estimates + overview across age, sex, population
- Nature of G influence still under debate
o Do they all act in additive way or is there an non-additive functionality?
- Determining causal mechanisms
o This is a challenge to polygenic traits. We still have questions like how do we detect and
interpret the findings?
Twin studies were largely used to determine relative influence of genes and environment. Both twins used:
- MZ (monozygotic): share 100% of genes, 100% of shared environment, 0% non-shared environment
- DZ (dizygotic): share average of 50% of genes, 100% of shared environment, 0% of non-shared
environment
Lets say a trade is….
100% heritable and additive: similarity in MZ twice as high as in DZ twins:
o 2Rmz= rDZ
100% environment influences differences between individuals: similarity between MZ and DZ in
more or less similar (because their environment was 100% the same):
o RMz=RDZ
Posthuma wanted to do a meta-analysis to combine all the various twin studies done between 1900-2012. They
extracted: sample size (N) and correlation (r) from MZ and DZ. And they also extracted estimated influences of
genes (h2) and influences of environment (c2). If they could, they extracted separately for male/female, four age
groups and populations. Then they standardized trait classification, because every study defined IQ as
something different.
Almost 3000 twin studies done from 1958 until 2012.
Reported 17 804 traits
Total sample size was 14,5 million twin pairs (dependent)
They combined all of the traits together. The question they wanted to answer was: how heritable is any
measurable trait in a person?
The estimated heritability for any trait that you can measure in MZ twins, this was 0.64 and in DZ, this
was 0.34.
Trait of interest is any trait that we can measure: The estimated heritability of that trait is almost 50%.
So why we are different in those genes, is because we differ 50% in genes. Only 17% was explained by
shared environment.
Main conclusions:
- All traits heritable to some extend
- Influence of c (shared environment) is relatively small
, - Majority of traits are consistent with a model where all genetic variance is additive.
Heritability
- Proportion of trait variance attributable to genetic variance
- The extend of which observed individual differences can be traced back to genetic differences
- Causes genetically related individuals to correlate on a trait
- Suggests that variations in genes underlie trait differences between individuals
Two important discoveries in genome:
- Structure of DNA (1953)
- Sequencing of the human genome (3*109 base pairs) (2002).
DNA facts
23 sets of chromosomes in each cell
Chromosomes are transmitted from parent to off spring via meiosis
Each single chromosome is a DNA-molecule
DNA molecule consists of nucleuotides (ACGT)
DNA is a double helix with A+T and G+C
Codon: three bases. Transcription and translation make up amino-acids.
o Multiple codons code for the same aminoacid: DNA is robust against errors.
Codons between start and end sites: genes, provide blueprints for proteins
One chromosome consists of non-genic (90%) and genic regions (10%)
Humans have around 22-24 thousand genes
Not every gene is expressed in a cell
Specific set of genes expressed in a cell, determines its cell type
We share 87,5% of our DNA with a mouse, 99% with a chimpanzee and 99.9% with an individual. A million sites
in our DNA differs between individuals, which results in phenotypic differences.
Genetic variations, SNPs can occur:
o In gene: protein coding, regulatory region, exonic, intronic
o Outside genes: regulatory or of unknown function.
They can be:
o Harmless (small, no harmful change in phenotype)
o Harmful
o Latent: dependent on another factor
o Silent
Causes of variations:
o Mutation: level of base pairs
o Recombination: level of parts of the chromosome
o Segregation: level of combination of chromosomes
Monogenic vs polygenic disorders
Monogenic: influences by one gene. Most genetic causes already known.
Polygenic disorders/traits: influenced by > genes, of which the causes are mostly unknown. They are
often very complex, because they are caused my multiple genetic and environmental factors with
possible interaction.
Why do we want to find genetic variants linked to disease?
- Novel biological insights clinical advances therapeutic targets, biomarkers, prevention
- Improved measures of individual aetiological processes personalized medicine diagnostics,
prognostics, therapeutic optimization.
Association study design: we have a control group, 5% has variant X. we have a case group, 21% has variant X.
Candidate gene studies (until 1990): preselect several genes based upon knowledge and convenience. Then you
test for association with a trait.
Genome Wide Association Studies:
Microarrays can contain more than 1 million tagging SNPs covering the genome in high density.
, Strategy:
- Genotype of large set of individuals (cases+controls) on ~1 million SNPs.
o We don’t need to do all 3*109 SNPs, since close by SNPs are often correlated
- For each SNP: compare allele frequency across cases and controls, conduct a statistical test for a
difference in frequency.
The first outcome of GWAS is a Manhattan plot.
- Every dot: association test of one single SNP in a genome
Advantages:
- May identify several possible loci as spans whole genome
- Relationships between loci may identify new biological pathways
- Results from multiple studies can be integrated, aiding prioritization of genes for replications and
increasing statistical confidence
Disadvantages
- Increased likelihood of false positives
- Population stratification
- Large number of samples needed
- Vast amounts of data analysed and produced, need cluster computers.
The initial sample sizes for GWAS were 5 to 10 thousands, where they actually needed 100 thousands or
millions to ultimately allow the detection of genetic variants with very small effects.
Schizophrenia:
- N= 3300 cases/3500 controls: 2 risk genes found in whole genome
- N= 9000 cases/12 000 controls: 7 risk genes found in whole genome
- N= 14 000 cases/18 000 controls: 22 risk genes found in whole genome
- N= 40 000 cases/50 000 controls: 131 risk genes found current largest GWAS of Schizophrenia
o However: these genes explain <3% of the liability of schizophrenia
o However: risk associated to 8300 genetic variants explains ~32% of liability, so this includes
variants found in the study but not statistical significant
Huge sample sizes were needed and currently reached. GWAS results is found to be very reliable, many genes
now are discovered. GWAS of their own group had 1 million individuals.
Many complex traits are highly heritable. GWAS has detected some genetic variants, but as is said, this is a very
small portion of the whole genetic variance (<2%).
- Effect sizes are small, so large samples are needed
- No functional information is gotten: most findings are outside of the genes and not inside.
The majority of human complex traits are probably caused by thousands of genes of very small effect.
How do we biologically interpret GWAS results? gain mechanistic insight
- GWAS provides associated variant genes, yet no functionality of the variant
- Current GWAS results are difficult to interpret, complicating the formulation of hypotheses that can be
tested in functional experiments
There are 4 issues with GWAS hits for polygenic traits:
1. They are mostly outside genes or in non-coding genic regions, with likely regulatory functions that are
currently unknown
2. They have small effects
3. SNPs are correlated, which complicated pinpointing the causal SNP
4. There are 100s of genes involved in polygenic traits: a single gene will not provide the whole picture
What about the SNPs that are correlated?
GWAS first output is a Manhattan plot. A loci is a peak of the plot. When we zoom in on a locus, there are
multiple SNPs found on one locus that are all statistically significant. The top SNP is a diamond in the plot, but
it’s not most significant. The colour of the other SNPs denote how strongly they are correlated with the top SNP.
When they are closely located to this SNP, the genotype is correlated, meaning a very small p-value. In the area
there are many genes. If there would have been only one, it would’ve been easier to say which gene is
associated with which phenotype. However this is not the case, because multiple genes and SNPs are found in
the locus, it is hard which one is the one that is most likely to cause the phenotype?