LARGE SCALE ANALYSIS OF BIOMEDICAL DATA
Data-mining workflow ..........................................................................................................................2
Machine learning and data visualisation .............................................................................................. 20
Research data management ............................................................................................................... 26
Real world data .................................................................................................................................. 29
Generative AI ..................................................................................................................................... 33
Encryption ......................................................................................................................................... 36
1
,DATA-MINING WORKFLOW
FROM QUESTION TO DATA
DEFINE PROJECT AIM – QUESTION – HYPOTHESIS – OBJECTIVES
Goal: broad – long-term outcome – vision – impact
Broad visionary
Aim: purpose – overall objective – research aim
Focused and general
Research question: central scientific question
Precise and interrogative
Hypothesis: testable statement – predicting relationship between variables
Predictive and testable
Objectives: specific measurable steps
Concrete and actionable
o Define what proteins are differentially expressed in healthy diseased tissues
o Identify the regulatory pathways that are affected upon drug treatment in cell lines
o Determine whether treatment A results in more pronounced tumour shrinkage mice
compared to conventional therapies
o Compare the blood cell counts in patient group 1 versus patient group 2
! explorative: tentative – little is known yet
descriptive research: conclusive – explore and explain a situation
EXPERIMENTAL AND STUDY DESIGN
= how do you organise your experiment and generate the data to learn about an
a priori defined hypothesis or answer the biological question of interest
2
,FACTORS OF INTEREST
What experiments will you set up – what samples/material will you analyse – collection
e.g. concentrations of compound
Prospective retrospective
Prospective: watches for outcome + relates to other factors
o Take a cohort of subjects
o Watch over a long period
o Minimalize bias and loss of follow-up
! mostly cohort studies
- Outcome is measured after exposure/test
- Yields true incidence and relative risks
- May uncover unanticipated associations
- Best for common outcomes
- Takes a long time to complete
- Prone to attribution bias
- Prone to the bias of change in methods over time
Retrospective: looks backwards
+ examine exposure to risk or protection factors
o minimalize bias and confounding
! mostly case-control
- outcome is measured before exposure/test
- controls: selected on not having the outcome
- good for rare outcomes
- quicker to complete
- prone to selection bias
- prone to recall/retrospective bias
CONFOUNDING
= influence the result – but not interested in them
e.g. layout of 96-well plate – organisation of mice in cages – batches of materials used
Batches: performed on different days – by different people – different reagents – different location
! not all batch effects are confounding: random noise
Inability to distinguish effect of one factor (interesting) from the effect of another (confounding)
Severity
o Complete confounding: impossible to fix after the experiment
o Incomplete confounding: work around it in the analysis – but statistical power suffer
! dependent on the effect of the confounding factor
3
, Detection
o Possible: unexpectedly good separation between groups
o Visualize factors in experiment: replicates next to each other (instead of underneath)
Solution
o Avoid confounding during planning phase
- Exclude nuisance factors if possible
- Balance biological factors if possible
- Randomise if possible and relevant
o Include batch information in experimental metadata
SAMPLE NUMBERS – REPLICATES
Replicates
Types
o Genuine replicate: increases sample size N
Biological replicate: often but not always equivalent to genuine replicate
= use different biological samples of the same condition to measure the biological
variation between samples
o Pseudoreplicate: does not increase sample size N
Technical replicate: often but not always equivalent to pseudoreplicate
= use the same biological sample to repeat the technical or experimental steps in
order to accurately measure technical variation and remove it during analysis
! happens when observations share some important factor
e.g. same batch of reagents – treatment x all from the same litter – …
! research question: genuine replication on one level becomes pseudo on higher level
e.g. learning about lung cancer cell line: each replicate within cell line increases N
learning about lung cancer: each replicate within a particular line is pseudo
Effect: pseudoreplicates don’t contain the same amount of info as genuine replicates
= falsely shrinks uncertainty estimates and results in too low/significant p-values
4
Data-mining workflow ..........................................................................................................................2
Machine learning and data visualisation .............................................................................................. 20
Research data management ............................................................................................................... 26
Real world data .................................................................................................................................. 29
Generative AI ..................................................................................................................................... 33
Encryption ......................................................................................................................................... 36
1
,DATA-MINING WORKFLOW
FROM QUESTION TO DATA
DEFINE PROJECT AIM – QUESTION – HYPOTHESIS – OBJECTIVES
Goal: broad – long-term outcome – vision – impact
Broad visionary
Aim: purpose – overall objective – research aim
Focused and general
Research question: central scientific question
Precise and interrogative
Hypothesis: testable statement – predicting relationship between variables
Predictive and testable
Objectives: specific measurable steps
Concrete and actionable
o Define what proteins are differentially expressed in healthy diseased tissues
o Identify the regulatory pathways that are affected upon drug treatment in cell lines
o Determine whether treatment A results in more pronounced tumour shrinkage mice
compared to conventional therapies
o Compare the blood cell counts in patient group 1 versus patient group 2
! explorative: tentative – little is known yet
descriptive research: conclusive – explore and explain a situation
EXPERIMENTAL AND STUDY DESIGN
= how do you organise your experiment and generate the data to learn about an
a priori defined hypothesis or answer the biological question of interest
2
,FACTORS OF INTEREST
What experiments will you set up – what samples/material will you analyse – collection
e.g. concentrations of compound
Prospective retrospective
Prospective: watches for outcome + relates to other factors
o Take a cohort of subjects
o Watch over a long period
o Minimalize bias and loss of follow-up
! mostly cohort studies
- Outcome is measured after exposure/test
- Yields true incidence and relative risks
- May uncover unanticipated associations
- Best for common outcomes
- Takes a long time to complete
- Prone to attribution bias
- Prone to the bias of change in methods over time
Retrospective: looks backwards
+ examine exposure to risk or protection factors
o minimalize bias and confounding
! mostly case-control
- outcome is measured before exposure/test
- controls: selected on not having the outcome
- good for rare outcomes
- quicker to complete
- prone to selection bias
- prone to recall/retrospective bias
CONFOUNDING
= influence the result – but not interested in them
e.g. layout of 96-well plate – organisation of mice in cages – batches of materials used
Batches: performed on different days – by different people – different reagents – different location
! not all batch effects are confounding: random noise
Inability to distinguish effect of one factor (interesting) from the effect of another (confounding)
Severity
o Complete confounding: impossible to fix after the experiment
o Incomplete confounding: work around it in the analysis – but statistical power suffer
! dependent on the effect of the confounding factor
3
, Detection
o Possible: unexpectedly good separation between groups
o Visualize factors in experiment: replicates next to each other (instead of underneath)
Solution
o Avoid confounding during planning phase
- Exclude nuisance factors if possible
- Balance biological factors if possible
- Randomise if possible and relevant
o Include batch information in experimental metadata
SAMPLE NUMBERS – REPLICATES
Replicates
Types
o Genuine replicate: increases sample size N
Biological replicate: often but not always equivalent to genuine replicate
= use different biological samples of the same condition to measure the biological
variation between samples
o Pseudoreplicate: does not increase sample size N
Technical replicate: often but not always equivalent to pseudoreplicate
= use the same biological sample to repeat the technical or experimental steps in
order to accurately measure technical variation and remove it during analysis
! happens when observations share some important factor
e.g. same batch of reagents – treatment x all from the same litter – …
! research question: genuine replication on one level becomes pseudo on higher level
e.g. learning about lung cancer cell line: each replicate within cell line increases N
learning about lung cancer: each replicate within a particular line is pseudo
Effect: pseudoreplicates don’t contain the same amount of info as genuine replicates
= falsely shrinks uncertainty estimates and results in too low/significant p-values
4