100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4,6 TrustPilot
logo-home
Samenvatting

Samenvatting Statistics en Methodology

Beoordeling
-
Verkocht
5
Pagina's
108
Geüpload op
27-03-2021
Geschreven in
2020/2021

Samenvatting inclusief alle uitwerkingen van R en afbeeldingen van voorbeelden erbij.

Instelling
Vak











Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Geschreven voor

Instelling
Studie
Vak

Documentinformatie

Geüpload op
27 maart 2021
Aantal pagina's
108
Geschreven in
2020/2021
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Samenvatting Statistics en Methodology.

Week 1.

Video 1. Basics 1.

Statistical reasoning is thinking carefully about conclusions and precise measurements in our tests.

Data scientists must scrutinize large numbers of data and extract useful knowledge. Data contains
raw information, to convert this info into actionable knowledge. Data scientists apply various data
analytic techniques when presenting the results of such analysis. Data scientists must be careful not
to overstate their findings. Too much confidence in an uncertain finding could lead your employer to
waste large amounts of resources chasing data anomalies. Stats offer us a way to protect ourselves
from ourselves.

Probability distributions quantify how likely it is to observe each possible value of some probabilistic
entity. Probability distributions are re-scaled frequency distributions. We can build up the intuition of
a probability density by beginning with a histogram (density = proportion). With an infinite number of
bins, a histogram smooths into a continuous curve.

 In a loose sense, each point on the curve gives the probability of observing the corresponding
X value in any given sample.
 The AUC must integrate to 1.0




Video 2. Basics 2.

Statistical testing = in practice we may want to distill the information in the preceding plot into a
simple statistic so we can make a judgement. One way to distill this information and control for
uncertainty when generating knowledge is through statistical testing. When we conduct statistical
tests, we weight the estimated effect by the precision of the estimate. A common type of statistical
test, the wald test (t-test) follows this pattern:




If we want to test the null of a zero mean difference applying wald test logic to control for the
uncertainty in our estimate results in the familiar t-test:

,(don’t memorize formulas)!!

You want the test statistic to be large to have more certainty.

How do we use a test statistic to compare for example lap times?

 A test statistic by itself, is just an arbitrary number.
 To conduct the test, we need to compare the test statistic to some objective reference
 This objective reference needs to tell us something about how exceptional our test statistic
is.
 The specific reference we will be employing is known as a sampling distribution of the test
statistic.

A sampling distribution is simply the probability distribution of a parameter.

 The population is defined by an infinite sequence of repeated tests. The sampling distribution
quantifies the possible values of the test statistic over infinite repeated sampling.
 The area of a region under the curve represents the probability of observing a test statistic
within the corresponding interval.

Note that a sampling distribution is a slightly different concept that the distribution of a random
variable:

 The sampling distribution quantifies the possible values of a statistic (mean, t-stat,
correlation coefficient, etc.).
 The distribution of a random variable quantifies the possible values of a variable (age,
gender, income, movie preference, etc.).

The t-test we’ve been considering is a way to summarize the comparison of two variable
distributions.

 The t-stats also has a sampling distribution that quantifies the possible t-values we could get
if we repeatedly drew samples from the variables distributions and re-computed a t-stats
each time.

To quantify how exceptional our estimated t-stats is, we compare the estimated value to a sampling
distribution of t-stats assuming no effect, this distribution quantifies H0  the special case of a H0 of
no effect is called the nil-null. If our estimated statistic would be very unusual in a population where
the H0 is true, we reject the Null and claim a ‘statistically significant’ effect.

,We can find the probability associated with a range of values by computing the area of the
corresponding slice from the distribution.




By calculating the area in the null distribution that exceeds our estimated test statistic, we can
compute the probability of observing the given test statistic, or one more extreme, if the H0 were
true. In other words, we can compute the probability of having sampled the data we observed, or
more unusual data, from a population wherein there is no true mean difference in lap times. This
value is tha infamous p-value.




The preceding test is one-tailed, we use a one-tailed test when we have direction hypotheses. Since
we didn’t expect setup B to out perform setup A, we need to use a two-tailed test.

, Consider the one-tailed test for our estimated test statistic of t = 1.86 that produces a p-value of p =
0.032:

 We cannot say that there is a 0.032 probability that the true mean difference is greater than
zero.
 We cannot say that there is a 0.032 probability that the Ha is true.
 We cannot say that there is a 0.032 probability that the Null hypothesis is false.
 We cannot say that there is a 0.032 probability of replicating the observed effect in future
studies.

How do we actually interpret p-values? The p-value tells us . But what we really want to
know is . All that we can say is that there is a 0.032 probability of observing a test
statistic at least as large as T, if H0 is true. Our test uses the same logic as proof by contradiction.

The probability of observing any individual point on a continuous distribution is exactly zero.

Video 3. Basics 3.

Statistical testing is a very useful tool, but it quickly reaches a limit. In experimental context, real-
world messiness is controlled through random assignment, and statistical testing is a sufficient
method of knowledge generation. Data scientists rarely have the luxury of being able to conduct
experiments. Data scientists work with messy observational data and usually don’t have questions.
That tend themselves to rigorous testing. Data scientists need statistical modeling.

The idea of statistical modeling: modelers attempt to build a mathematical representation of the
(interesting aspects) of a data distribution. The model succinctly describes whatever system is being
analyzed. Beginning with a model ensures that we are learning the important features of a
distribution. The modelling approach is especially important in messy data science applications
where clear a priori hypothesis are rare.

To apply a modelling approach to our example problem we consider the combined distribution of lap
time .the model we construct will explain variation in lap times based on interesting features. In this
simple case the only feature we consider is the type of setup.

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
robinvanheesch1 Tilburg University
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
102
Lid sinds
5 jaar
Aantal volgers
75
Documenten
11
Laatst verkocht
1 maand geleden

4.5

11 beoordelingen

5
7
4
2
3
2
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen