100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4,6 TrustPilot
logo-home
Samenvatting

Summary Data Science Research Methods (JBM025)

Beoordeling
-
Verkocht
2
Pagina's
26
Geüpload op
26-06-2022
Geschreven in
2021/2022

Summary on the course Data Science Research Methods (JBM025) from the major Data Science in Eindhoven and Tilburg. This course has two parts. The first part focusses on the scientific method and design of experiments (DOE). The second part focusses on econometrics and builds upon what is discussed in the course Business Analytics.

Meer zien Lees minder










Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
26 juni 2022
Aantal pagina's
26
Geschreven in
2021/2022
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

DATA SCIENCE RESEARCH METHODS
CONTENTS



Deriving optimal settings 14
The scientific method 2 Optimums 14
Six Sigma 2 Optimisation scheme 14

sample size determination 3 Econometrics for data scientists 15
Minimal sample sizes 3 Random variables 15
Normal distribution 3 regressions 16
Binomial distribution 4 Bivariate and multivariate regressions 16
When 𝝈 or 𝒑 is not known 4 Ordinary least squares (OLS) 16
Power analysis 4 Instrumental variable estimation 16
Normal distribution 4
Binomial distribution 4 Causality and selection 17
Causality 17
Analysis of variance (ANOVA) 5 Selection and selection bias 17
ANOVA table 5 Regression and randomized experiments 18
Potential problems with experiments 18
ANOVA – power and multiple comparisons 6
ANOVA power 6 Selection on observables and matching 19
Multiple comparisons 6 Matching 19
Fisher Least Significance Difference (LSD) 6 3 methods of matching 20
Tukey’s Honest Significant Difference (HSD) 6 Exact matching 20
Matching based on closeness of
Two-factor designs and blocking 7 observables 20
Propensity score matching 21
Full factorial designs 8 OLS estimator as matching estimator 21
DOE: how to determine whether an individual Flexible OLS as matching estimator 21
factor is of importance 9
Blocking with 2 factors 9 Differences-in-differences estimation 22
Some important details 23
Fractional Factorial designs 10 Generalization: 23
Fractional experiments 10
fractional factorials 10 Regression Discontinuity design (RDD) 24
Sharp regression discontinuity design 24
Response Surface Optimisation 12 Main idea and interpretation 24
Improvement Efficiently: finding near-optimal Estimation of the treatment effect in Sharp
factor settings 12 RDD 25
box/Simplex method 12 Approach 2 25
Steepest ascent/descent method 12 Approach 1 25
Quadratic models 13 Fuzzy regression discontinuity design 25
Response surface designs 13 Estimation the fuzzy RD 26
Central Composite Design (CCD) 13 Alternative to this estimation 26
Box-Behnken Design 14 Specification testing 26

,THE SCIENTIFIC METHOD

Key concepts What should you be able to do?
 Scientific method  Link elements of Six Sigma to the scientific method
 Experiment  Translate a case study in terms of independent variables (factors) and
 Factor dependent variables
 Independent variable  Be able to distinguish in a specific data science context, which of the
 Six Sigma three basic goals is relevant


Key insights
 It is important to identify which of the three different data science goals are relevant given a certain context
 The scientific method is an iterative process
 If you do not plan an experiment well in advance, then no statistical analysis may yield the hoped for results
 Experiments may involve several factors, each or which may have more than 2 levels
 The scientific method is also very useful in industry
 The Six Sigma approach in industry has incorporated several aspects of the scientific method.


Data science has three goals: Business has similar distinctions regarding analytics:
1. Description 1. Descriptive analytics provide insight into the past
2. Prediction 2. Predictive analytics provide understanding of the future
3. Explanation 3. Prescriptive analytics advice on the possible outcomes

Basic elements of the (iterative) scientific method Steps in experimentation
1. Formulate a question 1. Plan the experiment
2. Perform background research 2. Design the experiment
3. Formulate the hypothesis (answer) 3. Perform the experiment
4. Determine the logical consequences of the hypothesis 4. Analyse the resulting data
5. Collect observations (experiment) 5. Confirm the results
6. Test the truth of the hypothesis by analysing observations (statistics) 6. Evaluate the conclusion
7. Report the results
8. If the hypothesis is not confirmed, go back to 2


There are a number of valid reasons for the iterative approach:
1. New insights were obtained after analysing the experiment
2. New questions arose from the experiment
3. If the hypotheses are built upon wrong assumptions.
The iterative nature means that, if a hypothesis is refuted by the experiment, you should start over again and form
a new hypothesis to verify the new hypothesis. This iteration should be repeated until it’s no longer necessary.

SIX SIGMA

Six Sigma A disciplined, data-driven methodology for process improvement.
It is a combination of quality management tools and the statistical method
DMAIC The circular problem-solving approach of Six Sigma.
Its steps correspond to steps in experimentation of the scientific method:
Define (𝟏, 𝟐) – Measure (𝟑) – Analyse (𝟒) – Improve ( ) – Control ( )


Additionally, DMAIC also uses the principles of the scientific method:
1. DMAIC cycle uses the same iterative discovery cycle
2. It puts emphasis on doing well-defined experiments to discover new insights
3. It’s data driven and puts emphasis on quantification
4. It looks for causal relationships
5. It puts emphasis on proper verification and validation of results

, SAMPLE SIZE DETERMINATION
How much data do I need to collect?

Key concepts What should you be able to do?
 p-value  Compute the minimal sample size determination in terms of CI width
 hypothesis tests when you are given the formula (normal, binomial)
 width confidence interval  Compute the minimal sample size determination in terms of power
 power when you are given the formula (normal, binomial)
 minimal sample size  Compute minimal sample sizes when given a simple confidence or
power formula for a distribution

Key insights
 The absolute error parameter is the half-width of the CI in case of symmetric CIs
 CI width in binomial and normal distributions leads to the minimal sample size
 Minimal sample size determination binomial cases requires extra information on the success probability 𝑝


There are three basic ways of hypothesis testing:
1. Is test statistic in critical region (yes/no) This does not provide a lot of information
2. P-values Allows for people to choose their own 𝛼 value
3. Confidence intervals Gives insight in how uncertain we are about the prediction
(𝜽 ̂ + 𝒄) is a 𝟏𝟎𝟎(𝟏 − 𝜶)% CI when 𝑷(𝜽
̂ − 𝒄 ,𝜽 ̂−𝒄<𝜽< 𝜽 ̂ + 𝒄) = 𝟏 − 𝜶


Type I error False positives
𝜶: The probability to reject 𝑯𝟎 when 𝑯𝟎 is true. 𝟏 − 𝜶 is the True negative (not rejecting 𝑯𝟎 when true)
Type II error False negatives
𝜷: The probability of not rejecting 𝑯𝟎 when 𝑯𝟎 is false.
Power True positives
𝟏 − 𝜷: the probability of rejecting 𝑯𝟎 when 𝑯𝟎 is false

Z-tests (Normal distribution) 𝟏𝟎𝟎(𝟏 − 𝜶)% CI for 𝝁:
𝑋𝑖 ~𝑁(𝜇, 𝜎 2 ) + 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 𝜎 𝜎
൬𝑥ҧ − 𝑧𝛼/2 , 𝑥ҧ + 𝑧𝛼/2 ൰
𝐻0 : 𝜇 = 𝜇0 ξ𝑛 ξ𝑛
𝐻𝑎 : 𝜇 ≠ 𝜇0
Significance level 𝛼 𝑋ത − 𝜇0
𝝈𝟐 𝑇=
Decision rule: reject if ȁ𝑻ȁ > 𝒛𝜶/𝟐 , 𝑻~𝑵(𝟎, ) 𝜎/ξ𝑛
𝒏




MINIMAL SAMPLE SIZES

The formula to calculate the minimal sample size can be derived from the Confidence Interval.
The formula for the half-width returns the Error (𝑬), this can then be rewritten to calculate 𝑛.
𝒛𝜶/𝟐 𝟐
The formulas to calculate the sample size have a similar form: 𝒏 ≥ ⌈( ) 𝝈𝟐 ⌉
𝑬

If the deviation is not absolute but relative to the expected value 𝜎 (e.g. p of the response time), then 𝐸 = 𝑝 × 𝜎



NORMAL DISTRIBUTION

One-sample Two-sample
If 𝜎 is known If the 𝜎s are known, and 𝑛1 = 𝑛2 = 𝑛
𝜎
CI ̅ ± 𝒛𝜶/𝟐
𝒙 𝝈𝟐𝟏 + 𝝈𝟐𝟐
ξ𝑛 CI ̅𝟐 ± 𝒛𝜶/𝟐 √
̅𝟏 − 𝒙
𝒙
𝒏
𝜎
Error 𝐸 ≥ 𝑧𝛼/2 ×
ξ𝑛
Sample 𝝈𝟐 + 𝝈𝟐𝟐 𝒛𝜶/𝟐 𝟐 𝟐
Sample 𝒛𝜶/𝟐 𝟐 size 𝑬 ≥ 𝒛𝜶/𝟐 √ 𝟏 ⇒ 𝒏≥( ) (𝝈𝟏 + 𝝈𝟐𝟐 )
𝒏 𝑬
size 𝒏≥( ) 𝝈𝟐
𝑬

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
NienkeUr Technische Universiteit Eindhoven
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
49
Lid sinds
3 jaar
Aantal volgers
18
Documenten
11
Laatst verkocht
1 week geleden

4,7

3 beoordelingen

5
2
4
1
3
0
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen