Data Science in Biomedicine
Data Science in Biomedicine
25/9/23 -- Introduction
Data science – translate large data sets to something you can understand and discuss.
Data sources – patient (BP e.g.), biomedical (electronic recorder data, omics). You can do
personalized health data analysis.
Numerical, textual, categorical, imaging, clinical, genomic data.
Genomic data – DNA, genes, proteins, RNA, SNPs, ncRNA, splice variants, RNA expression
levels (expression of genes).
How is gene expression measured? Data will come from next generation sequencing NGS.
Comes from Sanger sequencing. NGS Illumina is dominating the market. Third generation
sequencing like Nanopore, put a drop on it and it start sequencing DNA.
How much data is produced with NGS? A lot, 6 Tb(ases).
How big is the human genome? 3 billion basepairs, 6 billion bases (double stranded DNA).
How many genes are encoded (protein encoding genes) in the genome? 20.000 approx. It
encode for more because also splice variants involved.
Pipeline to analyze data.
ChatGPT’s code interpreter! Renamed to Advanced Data Analytics.
You need to upload it, but if it is not published you can not upload it! Be aware.
R – retrieve data from a database, apply statistical analyses, visualize data. If data is in
wrong format then rewrite it via Python or ChatGPT!
R vs Python -> R is dedicated to statistics, R is popular in research, many good libraries for
R (genomics, GWAS, transcriptomics), R is not a real programming language but more a
statistical scripting tool. Visualization tool of R is compatible with Python. Python is easier
and much better in handling text files and data text files. R and Python are slower than C++.
In Python many AI libraries, tools that you can use.
Excel vs R -> in excel you can load data by opening a file or copy paste a data table. In R
you have a programming line.
You can also log transform data with log(). Better reflection of your data.
In R it is easy to plot a subset!
Multiple graphics -> to add multidimensional data to graphs, they can be plotted as a matrix.
To combine all the graphs and get the info out, using Machine Learning algorithms.
1
, Data Science in Biomedicine
26/9/23 -- Basic Statistics
Statistics -> interpretation of results, determine the significance of results, what should be the
difference of results. When is something different; good or wrong? Use statistics as a tool. P-
value, impact of risk, identify problem, where does the data come from? Which data and
conclusions are trustworthy? How reliable is the p value?
Variation what is causing the difference between people e.g. weight. In measurement then
standard errors are made.
Factors that can cause variation -> instrumental precision and calibration, human error,
inconsistencies in measurement technique, variability in body composition, clothing,
positioning, health status, time of measurement, growth and sample size and diversity.
Important to know what is causing variation.
Measurement -> meaning, you always should define your experiments properly, what is the
main source of variation? Re-think your experiment. After standardization, do we always get
exactly the same value?
This is all very logical but very important - measurement shows variation!!! Do experiment,
do statistics, see variation, redo experiment.
p-value => distribution curve in measurement, after certain threshold value you can decide
does it differentiate more from the average. How likely is it that a certain measurement is in
your area where the observed data point is.
What is unlikely?? 5%? P-value = 0.05. I want to be for 95% sure that there is a difference!
p-value = 0.05 is often used as cutoff. 5% chance that the null-hypothesis is true!
The impact of risk is also important, 4% bike failure or 4% plane failure, then do not take the
plane. To determine a standard p-value is not possible. Your should also think about the
validity of the p-value.
Same statistics, same p-value, different impact of risk!!!
Issue with statistics: calculate p-value but it never tells you if it is good or bad. In BMS often
an ethical discussion: risk for patient, risk of not treating patient.
*A p-value cutoff of 0.05 is a good starting point but always evaluate this assumption!
How are p-values calculated?
Generating data - a statistician want a good designed study that answers the question,
trustworthy data, and many replicates. He knows how to analyse data and calculate p-
values. They often don’t know the theoretical background of the data and the impact of risk
(how to choose the threshold) and potential pitfalls. That is what you know so think about
statistics! Often need other parameters (e.g. not only looking at length but also female/male
etc., samples not equal if one variant is the same, also consider others!).
*Change the setup of experiment to really check what causes the variation.
Biased samples, overgeneralization, causality, incorrect analysis choices, violation of the
assumption for an analysis, data mining (multiple testing correction, rare events, FDR).
Basic statistics -> T-tests, linear regression, permutation testing. FDR testing. Fischer’s
exact test, Chi-squared test, Pearson’s vs Spearmen correlation, PCA.
Data awareness: how much data do you need. You need replicates. Statistics are needed to
support your data is statistical numbers. There can be some pitfalls (outliers)
2
, Data Science in Biomedicine
T-test => t-statistics (Student’s t-test), compares 2 data sets and tells you if they are
different from each other (e.g. drug treated and placebo).
Types: independent samples (compares the means for two independent groups),
paired samples (compares the means from the same group (at different time points)),
one sample (test the mean of a single group against a known mean (standard or
reference!)).
Comparing students from two universities on their genetics skills.
Independent
Comparing the grades of students between two semesters. Paired
Comparing the condition of a group of people before and after a summer training camp.
Paired
Check if the alcohol consumption of students is higher than average (of the whole population).
One
Check if the alcohol consumption of bachelor students is higher than master students.
Independent
Two blood pressure measurements on the same person using different equipment. Paired
Knee MRI costs at two different hospitals.
Independent
Is the COVID-19 infection rates higher for BSc students or MSc students. Independent
Paired T-test => before and after treatment in the same mice so do paired test (paired
samples). The null hypothesis is that the pairwise difference between the two tests is
equal (H0: µA-µB = µd = 0). Is there a significant difference before
and after treatment, define threshold and then calculate the p-
value.
For standard T-test you need a normal distribution!!!
If you assume the data is normally distributed then use the
formula! Know by heart - calculate the t-value!!!
sigmaD = sum of the differences (after-before : the control group is
the before group, the consequences of the treatment are the target
and here is the after group – for the answer the order doesn’t matter but normally the
target – the control!!! T-C)
N = number of samples
Be aware of sigmaD2 and (sigmaD)2, different.
Degrees of freedom DF = samples – 1. Take the p-value of 0.05. find the t-value in the
T-distribution table (Two tails), use p-value and DF. See the t-value and then see
whether the t-value calculated is inside of the t-value of above it and then reject it! You
can also do it in R. *Smaller p, more sure samples are not equal.
Independent samples T-Test => compare the means of 2
data sets. Assumptions:
Independence: you need two independent, categorical
groups, e.g. “males” and “females”.
Normality: the dependent variable should be
approximately normally distributed (on a continuous
scale).
Homogeneity of Variance: variances should be equal.
You can have different numbers of samples. Formula is
different!!! There is not difference between the samples
because they are independent. But also in this case
calculate t-value and then use the table again. Degrees
of freedom here is -> (nA-1 + nB-1) Be aware of the squares! Also use the mean.
3
Data Science in Biomedicine
25/9/23 -- Introduction
Data science – translate large data sets to something you can understand and discuss.
Data sources – patient (BP e.g.), biomedical (electronic recorder data, omics). You can do
personalized health data analysis.
Numerical, textual, categorical, imaging, clinical, genomic data.
Genomic data – DNA, genes, proteins, RNA, SNPs, ncRNA, splice variants, RNA expression
levels (expression of genes).
How is gene expression measured? Data will come from next generation sequencing NGS.
Comes from Sanger sequencing. NGS Illumina is dominating the market. Third generation
sequencing like Nanopore, put a drop on it and it start sequencing DNA.
How much data is produced with NGS? A lot, 6 Tb(ases).
How big is the human genome? 3 billion basepairs, 6 billion bases (double stranded DNA).
How many genes are encoded (protein encoding genes) in the genome? 20.000 approx. It
encode for more because also splice variants involved.
Pipeline to analyze data.
ChatGPT’s code interpreter! Renamed to Advanced Data Analytics.
You need to upload it, but if it is not published you can not upload it! Be aware.
R – retrieve data from a database, apply statistical analyses, visualize data. If data is in
wrong format then rewrite it via Python or ChatGPT!
R vs Python -> R is dedicated to statistics, R is popular in research, many good libraries for
R (genomics, GWAS, transcriptomics), R is not a real programming language but more a
statistical scripting tool. Visualization tool of R is compatible with Python. Python is easier
and much better in handling text files and data text files. R and Python are slower than C++.
In Python many AI libraries, tools that you can use.
Excel vs R -> in excel you can load data by opening a file or copy paste a data table. In R
you have a programming line.
You can also log transform data with log(). Better reflection of your data.
In R it is easy to plot a subset!
Multiple graphics -> to add multidimensional data to graphs, they can be plotted as a matrix.
To combine all the graphs and get the info out, using Machine Learning algorithms.
1
, Data Science in Biomedicine
26/9/23 -- Basic Statistics
Statistics -> interpretation of results, determine the significance of results, what should be the
difference of results. When is something different; good or wrong? Use statistics as a tool. P-
value, impact of risk, identify problem, where does the data come from? Which data and
conclusions are trustworthy? How reliable is the p value?
Variation what is causing the difference between people e.g. weight. In measurement then
standard errors are made.
Factors that can cause variation -> instrumental precision and calibration, human error,
inconsistencies in measurement technique, variability in body composition, clothing,
positioning, health status, time of measurement, growth and sample size and diversity.
Important to know what is causing variation.
Measurement -> meaning, you always should define your experiments properly, what is the
main source of variation? Re-think your experiment. After standardization, do we always get
exactly the same value?
This is all very logical but very important - measurement shows variation!!! Do experiment,
do statistics, see variation, redo experiment.
p-value => distribution curve in measurement, after certain threshold value you can decide
does it differentiate more from the average. How likely is it that a certain measurement is in
your area where the observed data point is.
What is unlikely?? 5%? P-value = 0.05. I want to be for 95% sure that there is a difference!
p-value = 0.05 is often used as cutoff. 5% chance that the null-hypothesis is true!
The impact of risk is also important, 4% bike failure or 4% plane failure, then do not take the
plane. To determine a standard p-value is not possible. Your should also think about the
validity of the p-value.
Same statistics, same p-value, different impact of risk!!!
Issue with statistics: calculate p-value but it never tells you if it is good or bad. In BMS often
an ethical discussion: risk for patient, risk of not treating patient.
*A p-value cutoff of 0.05 is a good starting point but always evaluate this assumption!
How are p-values calculated?
Generating data - a statistician want a good designed study that answers the question,
trustworthy data, and many replicates. He knows how to analyse data and calculate p-
values. They often don’t know the theoretical background of the data and the impact of risk
(how to choose the threshold) and potential pitfalls. That is what you know so think about
statistics! Often need other parameters (e.g. not only looking at length but also female/male
etc., samples not equal if one variant is the same, also consider others!).
*Change the setup of experiment to really check what causes the variation.
Biased samples, overgeneralization, causality, incorrect analysis choices, violation of the
assumption for an analysis, data mining (multiple testing correction, rare events, FDR).
Basic statistics -> T-tests, linear regression, permutation testing. FDR testing. Fischer’s
exact test, Chi-squared test, Pearson’s vs Spearmen correlation, PCA.
Data awareness: how much data do you need. You need replicates. Statistics are needed to
support your data is statistical numbers. There can be some pitfalls (outliers)
2
, Data Science in Biomedicine
T-test => t-statistics (Student’s t-test), compares 2 data sets and tells you if they are
different from each other (e.g. drug treated and placebo).
Types: independent samples (compares the means for two independent groups),
paired samples (compares the means from the same group (at different time points)),
one sample (test the mean of a single group against a known mean (standard or
reference!)).
Comparing students from two universities on their genetics skills.
Independent
Comparing the grades of students between two semesters. Paired
Comparing the condition of a group of people before and after a summer training camp.
Paired
Check if the alcohol consumption of students is higher than average (of the whole population).
One
Check if the alcohol consumption of bachelor students is higher than master students.
Independent
Two blood pressure measurements on the same person using different equipment. Paired
Knee MRI costs at two different hospitals.
Independent
Is the COVID-19 infection rates higher for BSc students or MSc students. Independent
Paired T-test => before and after treatment in the same mice so do paired test (paired
samples). The null hypothesis is that the pairwise difference between the two tests is
equal (H0: µA-µB = µd = 0). Is there a significant difference before
and after treatment, define threshold and then calculate the p-
value.
For standard T-test you need a normal distribution!!!
If you assume the data is normally distributed then use the
formula! Know by heart - calculate the t-value!!!
sigmaD = sum of the differences (after-before : the control group is
the before group, the consequences of the treatment are the target
and here is the after group – for the answer the order doesn’t matter but normally the
target – the control!!! T-C)
N = number of samples
Be aware of sigmaD2 and (sigmaD)2, different.
Degrees of freedom DF = samples – 1. Take the p-value of 0.05. find the t-value in the
T-distribution table (Two tails), use p-value and DF. See the t-value and then see
whether the t-value calculated is inside of the t-value of above it and then reject it! You
can also do it in R. *Smaller p, more sure samples are not equal.
Independent samples T-Test => compare the means of 2
data sets. Assumptions:
Independence: you need two independent, categorical
groups, e.g. “males” and “females”.
Normality: the dependent variable should be
approximately normally distributed (on a continuous
scale).
Homogeneity of Variance: variances should be equal.
You can have different numbers of samples. Formula is
different!!! There is not difference between the samples
because they are independent. But also in this case
calculate t-value and then use the table again. Degrees
of freedom here is -> (nA-1 + nB-1) Be aware of the squares! Also use the mean.
3