100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Samenvatting Data Science & Biostatistics

Beoordeling
-
Verkocht
-
Pagina's
18
Geüpload op
12-12-2025
Geschreven in
2024/2025

Samenvatting van alle leerstof voor het tentamen. Heb hiermee in 1 keer het vak met een 8 afgerond!











Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
12 december 2025
Aantal pagina's
18
Geschreven in
2024/2025
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

BIOSTATISTICS
Statistics is the science of collecting, analysing, presenting and interpreting data. Many disciplines make use of
statistics, such as in doing medical research:
1. Ask question
o Extended antibiotic treatment (9x24h) is better than short antibiotic treatment (3x24h) for
the treatment of haematological patient with of iatrogenic neutropenia and fever of
unknown origin.
2. Formulate hypothesis
o Percentage of patients with fever-recurrence within 28 days does not differ between short
and extended antibiotic treatment.
3. Collect data
o Randomize 200 patients to receive either short or extended antibiotic treatment and count
the number of patients in each group with fever-recurrence within 28 days.
4. Analyse data
o 12 patients receiving short and 9 patients receiving extended antibiotic treatment had fever-
recurrence within 28 days; the 95% confidence interval for the difference of -3% is equal to (-
11.5%; 5.5%).
5. Formulate answer
o No statistical evidence for benefit of extended antibiotic treatment.


DESCRIPTIVE STATISTICS

There are several common types of study:
 Cross-sectional: data collected at one point in time
 Prospective: subjects included ‘at baseline’, outcome assessed in future/over time
o randomized controlled trial (RCT)
o longitudinal/observational study
 Retrospective: outcome has been assessed, looking back in time
With these types of studies, different kinds of data can be collected:
 Binary data: gender, HPV status (infected/not infected), myocardial infarction (yes/no)
 Categorical data: alcohol consumption (none/moderate/heavy), clinical T-stage (1/2/3/4), water
source (river/pond/spring)
 Continuous data: cholesterol, triglyceride concentration, quality of life
 Time-to-event data (survival): time to dead, time to recurrence after treatment, time to get employed
o Difficult, because there are always cases that drop out of the study
Descriptive statistics summarize and describe important features of the data and concern the sample. This can
be shown with graphics, such as histogram, boxplot, scatter plot, etc., or with numerical summary measures,
such as the mean, median, standard deviation (SD), percentage, etc. Inferential statistics are used to draw a
conclusion beyond the data sample, using effect size, confidence interval, hypothesis testing, etc. The
distribution of the data can have different shapes.
Symmetrical and bell-shaped Positively skewed (to the right) Negatively skewed (to the left)




1
There are several measured of centre. These are the mean (= Σ xi ), median (middle value), or mode (most
n
frequent value). If the distribution is right-skewed, the mean is larger than the median.


Page | 1

,There are also measured of spread. These are the standard deviation =
√ 1
n
Σ¿ ¿ ¿ , the variance (SD2), the
range (max-min), or interquartile range (IQR = Q3 – Q1). The common practice in medical articles for
symmetric distributions is to report the mean and SD. For skewed distributions, the median and IQR are
reported, and for proportions, the n and % are reported. A scatter plot can be made to look at the distribution
of the data. A Pearson correlation of r = +1 shows a perfectly positive linear association and r = -1 shows a
perfectly negative linear association.
cor = 0.98 cor = -0.02 cor = -0.96




CONFIDENCE INTERVALS

Inferential statistics allow to draw conclusion of a population based on a sample. For estimation, the effect size
is used, for uncertainty, the confidence interval is used, and for
hypothesis testing, the p-value is used. The central limit
theorem states that under certain conditions, the distribution
of the average of a large number of independent, identically
distributed random variables tends to be approximately
normally distributed, regardless of the original distribution of
the variables. This result is particularly important because it
allows statisticians to make inferences about population
parameters using sample data.

What is the mean FEV1 in a population of children aged 7 – 10 year?
Sample: N = 636 children, mean is 1.59 L, SD = 0.30. Uncertainty quantified by standard error (SE): SE mean =
SD 0.30
= = 0.012. 95% Confidence Interval (CI): (mean - 1.96 x SE mean; mean + 1.96 x SEmean) = mean ± 1.96
√ n √ 636
x SEmean. 95% CI: 1.59 ± 1.96 x 0.012 = [1.57; 1.61].

The standard error tells how certain you are of an estimated mean. It is used to calculate confidence intervals.
In 95% of cases, the actual mean lies within the CI. These CI formulas depend on the assumption that the
distribution of the mean/proportion is approximately normal. This is often reasonable, especially as n grows.
However, this is not the case for many other statistics.


DIAGNOSTIC TESTING

The sensitivity shows how many relevant items are selected, e.g. how many sick people are correctly identified
TP
as having the condition. It is calculated with . The specificity shows how many negative selected
TP+ FN
elements are truly negative, e.g. how many healthy people are identified as not having the condition. . It is
TN
calculated with . The sensitivity and specificity do not directly inform on the predictive value of
TN + FP
Page | 2

, positive or negative tests. This is done with the positive and negative predictive value. The PPV is the
TP
probability that a person who has a positive test result truly has the disease. It is calculated with .A
TP+ FP
high PPV means that if the test result is positive, there’s a high chance the person actually has the condition.
The NPV is the probability that a person who has a negative test result truly does not have the disease. It is
TN
calculated with . A high NPV means that if the test result is negative, there’s a high chance the
TN+ FN
person does not have the condition. The prevalence has a large impact on the PPV and NPV. The prevalence is
the % cases in the population.

Example: sensitivity = 90%, specificity = 95%, and N = 1000. Case 1 prevalence = 10%.
PPV = positive cases / number of positives = 90/135 =
2/3 = 66.7%
NPV = negative controls / number of negatives =
855/865 = 98.8%
Case 2 prevalence = 30%
PPV = positive cases / number of positives = 270/305 =
88.5% >> 66.7%
NPV = negative controls / number of negatives =
665/695 = 95.7% (< 98.8%)


The confidence interval for proportion p = 95% CI: p ± 1.96 x SEp, with SEp =
√ p(1− p) . The confidence
√n
interval for PPV is SEp =
√ p ⋅( 1− p)
=
√ 0.667 ⋅ 0.333 = 0.027. 95% CI: 0.667 ± 1.96 x 0.027 = [0.720;
√n √305
0.614]. It is similar for the sensitivity, specificity and NPV.


HYPOTHESIS TESTING
The null hypothesis states that there is no difference between the means of the groups that are compared. The
alternative hypothesis states that there is a difference observed when comparing the means of 2 groups:
 H0  mean FEV1 girls = mean FEV1 boys
 Ha  mean FEV1 girls ≠ mean FEV1 boys


CONFIDENCE INTERVALS AND P-VALUE

The effect size is the mean difference
between both groups, e.g. 1.66 – 1.54 = 0.12
(boys vs. girls). However, this does not
immediately mean that the null hypothesis
can be rejected. There are 2 approaches to
determine which of the hypotheses is true, which are using confidence intervals or p-values. To calculate the
confidence intervals, the standard error of the mean difference is calculated: SE diff =


√ ( n1−1 ) SD 21 + ( n2−1 ) SD22
n1+ n2−2
×
√ 1 1 . With this standard error, the confidence intervals can be
+
n1 n2
calculated with: meandiff ± 1.96 x SEdiff. If 0 does not fall in the confidence interval, the null hypothesis must be
rejected and the alternative hypothesis accepted.
Assume in the population that the mean difference is 0, which means that the null hypothesis is true. Then
there is a small chance that in the sample, a mean difference of 0.12 or more extreme is observed (> 0.12 or <
-0.12). This chance is the p-value. This chance is larger when the standard error is larger (more fluctuations), so
the p-value also depends on the SE. If the p-value is lower than 0.05, the null hypothesis must be rejected. The
probability of falsely rejecting a true H0 (type I error) is 0.05. An independent samples t-test or a paired

Page | 3

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
kimberleyvet Vrije Universiteit Amsterdam
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
36
Lid sinds
4 jaar
Aantal volgers
18
Documenten
31
Laatst verkocht
1 maand geleden

4,3

4 beoordelingen

5
2
4
1
3
1
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen