100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Samenvatting Statistics en Methodology

Rating
-
Sold
5
Pages
108
Uploaded on
27-03-2021
Written in
2020/2021

Summary including all the effects in R and images of examples.

Institution
Course











Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
March 27, 2021
Number of pages
108
Written in
2020/2021
Type
Summary

Subjects

Content preview

Samenvatting Statistics en Methodology.

Week 1.

Video 1. Basics 1.

Statistical reasoning is thinking carefully about conclusions and precise measurements in our tests.

Data scientists must scrutinize large numbers of data and extract useful knowledge. Data contains
raw information, to convert this info into actionable knowledge. Data scientists apply various data
analytic techniques when presenting the results of such analysis. Data scientists must be careful not
to overstate their findings. Too much confidence in an uncertain finding could lead your employer to
waste large amounts of resources chasing data anomalies. Stats offer us a way to protect ourselves
from ourselves.

Probability distributions quantify how likely it is to observe each possible value of some probabilistic
entity. Probability distributions are re-scaled frequency distributions. We can build up the intuition of
a probability density by beginning with a histogram (density = proportion). With an infinite number of
bins, a histogram smooths into a continuous curve.

 In a loose sense, each point on the curve gives the probability of observing the corresponding
X value in any given sample.
 The AUC must integrate to 1.0




Video 2. Basics 2.

Statistical testing = in practice we may want to distill the information in the preceding plot into a
simple statistic so we can make a judgement. One way to distill this information and control for
uncertainty when generating knowledge is through statistical testing. When we conduct statistical
tests, we weight the estimated effect by the precision of the estimate. A common type of statistical
test, the wald test (t-test) follows this pattern:




If we want to test the null of a zero mean difference applying wald test logic to control for the
uncertainty in our estimate results in the familiar t-test:

,(don’t memorize formulas)!!

You want the test statistic to be large to have more certainty.

How do we use a test statistic to compare for example lap times?

 A test statistic by itself, is just an arbitrary number.
 To conduct the test, we need to compare the test statistic to some objective reference
 This objective reference needs to tell us something about how exceptional our test statistic
is.
 The specific reference we will be employing is known as a sampling distribution of the test
statistic.

A sampling distribution is simply the probability distribution of a parameter.

 The population is defined by an infinite sequence of repeated tests. The sampling distribution
quantifies the possible values of the test statistic over infinite repeated sampling.
 The area of a region under the curve represents the probability of observing a test statistic
within the corresponding interval.

Note that a sampling distribution is a slightly different concept that the distribution of a random
variable:

 The sampling distribution quantifies the possible values of a statistic (mean, t-stat,
correlation coefficient, etc.).
 The distribution of a random variable quantifies the possible values of a variable (age,
gender, income, movie preference, etc.).

The t-test we’ve been considering is a way to summarize the comparison of two variable
distributions.

 The t-stats also has a sampling distribution that quantifies the possible t-values we could get
if we repeatedly drew samples from the variables distributions and re-computed a t-stats
each time.

To quantify how exceptional our estimated t-stats is, we compare the estimated value to a sampling
distribution of t-stats assuming no effect, this distribution quantifies H0  the special case of a H0 of
no effect is called the nil-null. If our estimated statistic would be very unusual in a population where
the H0 is true, we reject the Null and claim a ‘statistically significant’ effect.

,We can find the probability associated with a range of values by computing the area of the
corresponding slice from the distribution.




By calculating the area in the null distribution that exceeds our estimated test statistic, we can
compute the probability of observing the given test statistic, or one more extreme, if the H0 were
true. In other words, we can compute the probability of having sampled the data we observed, or
more unusual data, from a population wherein there is no true mean difference in lap times. This
value is tha infamous p-value.




The preceding test is one-tailed, we use a one-tailed test when we have direction hypotheses. Since
we didn’t expect setup B to out perform setup A, we need to use a two-tailed test.

, Consider the one-tailed test for our estimated test statistic of t = 1.86 that produces a p-value of p =
0.032:

 We cannot say that there is a 0.032 probability that the true mean difference is greater than
zero.
 We cannot say that there is a 0.032 probability that the Ha is true.
 We cannot say that there is a 0.032 probability that the Null hypothesis is false.
 We cannot say that there is a 0.032 probability of replicating the observed effect in future
studies.

How do we actually interpret p-values? The p-value tells us . But what we really want to
know is . All that we can say is that there is a 0.032 probability of observing a test
statistic at least as large as T, if H0 is true. Our test uses the same logic as proof by contradiction.

The probability of observing any individual point on a continuous distribution is exactly zero.

Video 3. Basics 3.

Statistical testing is a very useful tool, but it quickly reaches a limit. In experimental context, real-
world messiness is controlled through random assignment, and statistical testing is a sufficient
method of knowledge generation. Data scientists rarely have the luxury of being able to conduct
experiments. Data scientists work with messy observational data and usually don’t have questions.
That tend themselves to rigorous testing. Data scientists need statistical modeling.

The idea of statistical modeling: modelers attempt to build a mathematical representation of the
(interesting aspects) of a data distribution. The model succinctly describes whatever system is being
analyzed. Beginning with a model ensures that we are learning the important features of a
distribution. The modelling approach is especially important in messy data science applications
where clear a priori hypothesis are rare.

To apply a modelling approach to our example problem we consider the combined distribution of lap
time .the model we construct will explain variation in lap times based on interesting features. In this
simple case the only feature we consider is the type of setup.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
robinvanheesch1 Tilburg University
Follow You need to be logged in order to follow users or courses
Sold
101
Member since
4 year
Number of followers
75
Documents
11
Last sold
1 month ago

4.5

11 reviews

5
7
4
2
3
2
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions