LARGE SCALE ANALYSIS OF
BIOMEDICAL DATA
1. LESSON 1
1.1 DEFINE GOALS AND OBJECTIVES
Data science process
- Collection
- Cleaning
- Exploratory data analysis
- Model building
- Model deployments
GOALS
f.e.
- Development and implementation of protein-based assay to diagnose a specific cancer type
- Evaluation of a novel drug to treat colon cancer
- Prediction of cardiovascular disease based on blood counts and genomic data
(DATA-MINING) OBJECTIVE
Focus on data mining objective (Focusing on a data mining objective means identifying a specific goal or
question that you want to answer using data mining techniques.)
f.e.
- What proteins are differentially expressed in healthy vs diseased tissue
- What regulatory pathways are affected upon drug treatment in cell lines
- Does treatment A result in more pronounced tumor shrinkage in mice compared to conventional
therapy
- Comparative analysis of the blood cell count in 2 patientsgroup
SMART ->
Specific = who and what
Measurable = by how much
Achievable= how
Relevant= why
Time-bound= when
Know what research you will perform exploratory vs descriptive
Exploratory see the difference between f.e. healthy and unhealthy tissue
Descriptive you know specific what you want to check
Then you can try to get your data
1.2 EXPERIMENTAL/ STUDY DESIGN: SAMPLE NUMBERS, COFOUNDING,…
,1. How will your study be designed?
Garbage in= garbage out
(start with which kind of data you need and what kind and how many samples you need)
Data type samples experiment
2. What datatype is needed to answer the question?
Clinical data
Imaging data
Transcriptomics, genomics data
Flow cytometry
Proteomics
3. Will you generate own data or repurpose published data?
Sometimes database already available and you don’t need to go to the lab
There is a lot online test your hypotheses= repurpose data
4. What sample will be used?
A prospective study watches for outcomes, such as the development of a disease, during the study
period and relates this to other factors such as suspected risk or protection factor(s). The study
usually involves taking a cohort of subjects and watching them over a long period. The outcome of
interest should be common; otherwise, the number of outcomes observed will be too small to be
statistically meaningful (indistinguishable from those that may have arisen by chance). All efforts
should be made to avoid sources of bias such as the loss of individuals to follow up during the study.
Prospective studies usually have fewer potential sources of bias and confounding than retrospective
studies. Prospective investigation is required to make precise estimates of either the incidence of an
outcome or the relative risk of an outcome based on exposure. Retrospective
A retrospective study looks backwards and examines exposures to suspected risk or protection factors
in relation to an outcome that is established at the start of the study. Many valuable case-control
studies were retrospective investigations. Most sources of error due to confounding and bias are more
common in retrospective studies than in prospective studies. For this reason, retrospective
investigations are often criticized. If the outcome of interest is uncommon, however, the size of
prospective investigation required to estimate relative risk is often too large to be feasible. In
retrospective studies the odds ratio provides an estimate of relative risk. You should take special care
to avoid sources of bias and confounding in retrospective studies.
case-control studies
Case-Control studies are usually but not exclusively retrospective. The following notes relate case-
control to cohort
studies:
outcome is measured before exposure/biomarker test
controls are selected on the basis of not having the outcome
good for rare outcomes
quicker to complete
prone to selection bias
prone to recall/retrospective bias (do not remember previous events)
, cohort studies
Cohort studies are usually but not exclusively prospective. The following notes relate cohort to case-
control studies:
outcome is measured after exposure/biomarker test
yields true incidence rates and relative risks
may uncover unanticipated associations with outcome
best for common outcomes
takes a long time to complete
prone to attrition bias (unequal loss of participants)
prone to the bias of change in methods over time
WHAT IS ECPERIMENT DESIGN? HOW DO YOU ORGANISE YOUR EXPERIMENT AND GENERATE
THE DATA TO LEARN ABOUT AN A PRIORI DEFINED HYPOTHESIS OR ANSWER THE BIOLOGICAL
QUESTION OF INTEREST
Define the factors you are interested in
different treatment or combination to test
concentrations of compound
be aware of confounding factor, influence the result but you are not interested in
= the inability to distinguish the effect of one factor from the effect of another: interesting vs nuisance
You can have biological effect, technical effect
batch (tissue from different city= different moment, hospital, machine)
plate effect (Example: Imagine a study investigating the effect of a new drug on blood pressure. If a
confounding variable (like age) is not controlled for, older patients—who may have higher blood
pressure and be more likely to be prescribed the drug—might skew the results, making it appear that
the drug is more effective or less effective than it actually is.)
example see dia 25
layout of 96 well-plate
organisation of mice in cages
batches of materials used
batches:
how to know whether you have batches fe in rna seq experiment
were all RNA isolations performed on the same day
were all library preparation performed on the same day
did the same person perform the RNA isolation/ library preparation for all samples
did you use the same reagents for all samples
did you perform the rna isolation/ library preparation in the same location
if any of the answer is no batches
how bad is confounding
complete confounding= impossible to fix after experiment
Dit treedt op wanneer de associatie tussen de blootstelling en de uitkomst volledig wordt vertekend
door de confounder. Met andere woorden, als de confounder niet wordt gecontroleerd, lijkt er een
sterke relatie tussen de blootstelling en de uitkomst te zijn, die in werkelijkheid volledig te wijten is
aan de confounder.
with incomplete confounding try to wark around it in the analysis, often statistical power suffers
Dit treedt op wanneer de confounder de associatie tussen de blootstelling en de uitkomst verstoort,
maar niet volledig. De confounder heeft invloed, maar er blijft nog steeds een deel van de associatie
dat niet door de confounder wordt verklaard.
, Some confounding is worse than others, depends on the effect of the confounding factor
How to find confounding
Be weary of unexpectedly good separation between groups
Make plots for visualizing all factor in experiment
Dia 30 design 2 is better for design 1: if you see a different => don’t know if it is bcs of the
treatment or the line
Solution
Easiest to avoid confounding
o Exclude nuisance factors if possible, balance biological factors if possible + randomise if
possible
o Full randomization= lot of work, but often acceptable compromise exist
o Third well is what you are going to do
to avoid confounding in mice experiment
Ensure animal in each condition are all the same sex, age, litter, batch
If not possible ensure to split the animals equally between condition
Avoid batch effect
Design experiment in a way to avoid batches, if possible
If unable to avoid batches
o Do not confound your experiment by batch:
o Do split replicates of the different sample groups across batches the more replicate the
better (effectiviteit van de behandeling niet moet verwarren met de batc)
o Do include batch information in your experimental metadata (Voeg informatie over de
batches toe aan je experimentdata. Dit helpt later bij de analyse.)
During the analysis, we can regress out the variation due to batch if not confounded so it
doesn’t affect our result if we have that information
o Important= randomization
Three important things to avoid batch effect
No shuffling of an effect leads to uncorrectable confounding
Randomization
Blocking of an effect balanced design
decision about type and number of replicate
genuine replication vs pseudoreplication
Genuinee replicate increases ample size
BIOMEDICAL DATA
1. LESSON 1
1.1 DEFINE GOALS AND OBJECTIVES
Data science process
- Collection
- Cleaning
- Exploratory data analysis
- Model building
- Model deployments
GOALS
f.e.
- Development and implementation of protein-based assay to diagnose a specific cancer type
- Evaluation of a novel drug to treat colon cancer
- Prediction of cardiovascular disease based on blood counts and genomic data
(DATA-MINING) OBJECTIVE
Focus on data mining objective (Focusing on a data mining objective means identifying a specific goal or
question that you want to answer using data mining techniques.)
f.e.
- What proteins are differentially expressed in healthy vs diseased tissue
- What regulatory pathways are affected upon drug treatment in cell lines
- Does treatment A result in more pronounced tumor shrinkage in mice compared to conventional
therapy
- Comparative analysis of the blood cell count in 2 patientsgroup
SMART ->
Specific = who and what
Measurable = by how much
Achievable= how
Relevant= why
Time-bound= when
Know what research you will perform exploratory vs descriptive
Exploratory see the difference between f.e. healthy and unhealthy tissue
Descriptive you know specific what you want to check
Then you can try to get your data
1.2 EXPERIMENTAL/ STUDY DESIGN: SAMPLE NUMBERS, COFOUNDING,…
,1. How will your study be designed?
Garbage in= garbage out
(start with which kind of data you need and what kind and how many samples you need)
Data type samples experiment
2. What datatype is needed to answer the question?
Clinical data
Imaging data
Transcriptomics, genomics data
Flow cytometry
Proteomics
3. Will you generate own data or repurpose published data?
Sometimes database already available and you don’t need to go to the lab
There is a lot online test your hypotheses= repurpose data
4. What sample will be used?
A prospective study watches for outcomes, such as the development of a disease, during the study
period and relates this to other factors such as suspected risk or protection factor(s). The study
usually involves taking a cohort of subjects and watching them over a long period. The outcome of
interest should be common; otherwise, the number of outcomes observed will be too small to be
statistically meaningful (indistinguishable from those that may have arisen by chance). All efforts
should be made to avoid sources of bias such as the loss of individuals to follow up during the study.
Prospective studies usually have fewer potential sources of bias and confounding than retrospective
studies. Prospective investigation is required to make precise estimates of either the incidence of an
outcome or the relative risk of an outcome based on exposure. Retrospective
A retrospective study looks backwards and examines exposures to suspected risk or protection factors
in relation to an outcome that is established at the start of the study. Many valuable case-control
studies were retrospective investigations. Most sources of error due to confounding and bias are more
common in retrospective studies than in prospective studies. For this reason, retrospective
investigations are often criticized. If the outcome of interest is uncommon, however, the size of
prospective investigation required to estimate relative risk is often too large to be feasible. In
retrospective studies the odds ratio provides an estimate of relative risk. You should take special care
to avoid sources of bias and confounding in retrospective studies.
case-control studies
Case-Control studies are usually but not exclusively retrospective. The following notes relate case-
control to cohort
studies:
outcome is measured before exposure/biomarker test
controls are selected on the basis of not having the outcome
good for rare outcomes
quicker to complete
prone to selection bias
prone to recall/retrospective bias (do not remember previous events)
, cohort studies
Cohort studies are usually but not exclusively prospective. The following notes relate cohort to case-
control studies:
outcome is measured after exposure/biomarker test
yields true incidence rates and relative risks
may uncover unanticipated associations with outcome
best for common outcomes
takes a long time to complete
prone to attrition bias (unequal loss of participants)
prone to the bias of change in methods over time
WHAT IS ECPERIMENT DESIGN? HOW DO YOU ORGANISE YOUR EXPERIMENT AND GENERATE
THE DATA TO LEARN ABOUT AN A PRIORI DEFINED HYPOTHESIS OR ANSWER THE BIOLOGICAL
QUESTION OF INTEREST
Define the factors you are interested in
different treatment or combination to test
concentrations of compound
be aware of confounding factor, influence the result but you are not interested in
= the inability to distinguish the effect of one factor from the effect of another: interesting vs nuisance
You can have biological effect, technical effect
batch (tissue from different city= different moment, hospital, machine)
plate effect (Example: Imagine a study investigating the effect of a new drug on blood pressure. If a
confounding variable (like age) is not controlled for, older patients—who may have higher blood
pressure and be more likely to be prescribed the drug—might skew the results, making it appear that
the drug is more effective or less effective than it actually is.)
example see dia 25
layout of 96 well-plate
organisation of mice in cages
batches of materials used
batches:
how to know whether you have batches fe in rna seq experiment
were all RNA isolations performed on the same day
were all library preparation performed on the same day
did the same person perform the RNA isolation/ library preparation for all samples
did you use the same reagents for all samples
did you perform the rna isolation/ library preparation in the same location
if any of the answer is no batches
how bad is confounding
complete confounding= impossible to fix after experiment
Dit treedt op wanneer de associatie tussen de blootstelling en de uitkomst volledig wordt vertekend
door de confounder. Met andere woorden, als de confounder niet wordt gecontroleerd, lijkt er een
sterke relatie tussen de blootstelling en de uitkomst te zijn, die in werkelijkheid volledig te wijten is
aan de confounder.
with incomplete confounding try to wark around it in the analysis, often statistical power suffers
Dit treedt op wanneer de confounder de associatie tussen de blootstelling en de uitkomst verstoort,
maar niet volledig. De confounder heeft invloed, maar er blijft nog steeds een deel van de associatie
dat niet door de confounder wordt verklaard.
, Some confounding is worse than others, depends on the effect of the confounding factor
How to find confounding
Be weary of unexpectedly good separation between groups
Make plots for visualizing all factor in experiment
Dia 30 design 2 is better for design 1: if you see a different => don’t know if it is bcs of the
treatment or the line
Solution
Easiest to avoid confounding
o Exclude nuisance factors if possible, balance biological factors if possible + randomise if
possible
o Full randomization= lot of work, but often acceptable compromise exist
o Third well is what you are going to do
to avoid confounding in mice experiment
Ensure animal in each condition are all the same sex, age, litter, batch
If not possible ensure to split the animals equally between condition
Avoid batch effect
Design experiment in a way to avoid batches, if possible
If unable to avoid batches
o Do not confound your experiment by batch:
o Do split replicates of the different sample groups across batches the more replicate the
better (effectiviteit van de behandeling niet moet verwarren met de batc)
o Do include batch information in your experimental metadata (Voeg informatie over de
batches toe aan je experimentdata. Dit helpt later bij de analyse.)
During the analysis, we can regress out the variation due to batch if not confounded so it
doesn’t affect our result if we have that information
o Important= randomization
Three important things to avoid batch effect
No shuffling of an effect leads to uncorrectable confounding
Randomization
Blocking of an effect balanced design
decision about type and number of replicate
genuine replication vs pseudoreplication
Genuinee replicate increases ample size