100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Summary 2IAB0 + questions and correct answers

Beoordeling
-
Verkocht
-
Pagina's
51
Geüpload op
02-11-2021
Geschreven in
2020/2021

Full summary + all questions including the correct answers












Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
2 november 2021
Aantal pagina's
51
Geschreven in
2020/2021
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Lectures
EDA (Exploratory Data Analysis)
Data types and representations
What is data?
- Cambridge dictionary: “information, especially facts or numbers, collected to be examined and
considered and used to help decision-making, or information in an electronic form that can be
stored and used by a computer”
- In this course; We will say “data” when referring to raw, unorganized numbers, facts, etc. and
use the word “information” for structured, meaningful and useful numbers and facts.

Numerical data
- Continuous data: data that can attain any value on a given measurement scale
o Interval data: continuous data for which only differences have meaning, has no “zero
point” (you cannot make ratios)
o Ratio data: continuous data for which ratios make sense, has fixed “zero point”
- Discrete data: data that can only attain certain values (e.g. integers)
Categorical data
- Nominal data: two or more outcomes that have no natural order
- Ordinal data: two or more outcomes that have a natural order

More on data
- The difference between continuous and discrete data is that discrete data have “gaps”
- Temperature in degrees Celsius is not ratio data, since 20 °C is not twice as hot as 10 °C (this is
because 0 °C is not a fixed minimum)
- Lengths are ratio data, since 0 m is an absolute minimum and 20 m is twice as long as 10 m.
- Categorical data are sometimes labelled with numbers (e.g., bad = 1, neutral = 2, good = 3). Such
labels do not turn such data into numerical data, since there is no meaning to the numbers. \


Tables are good for two reasons
- For reading off values
- To draw attention to actual values

Two kinds of tables
- Reference table: store “all” data in a table so that it can be looked up easily
- Demonstration table: table to illustrate a point (so present just enough data)

Questions you should ask when investigating tables
- What kind of data types and units of measurement?
- Do the values make sense when you compare columns or rows?
- Which column/row has largest values?
- Which column/row has smallest values?




1

,Exploratory Data Analysis
Numerical quantities focus on expected values, graphical summaries (graphs/plots) on unexpected
values (quote by John Tukey).
Tukey promoted to use graphs to explore data before using more advanced (and he also invented some
new types of graphs)

Key features of EDA:
- Getting to know the data before doing further analysis
- Extensively using graphs
- Generating questions
- Detecting errors in data
o Important ways to spot errors;
 Asking yourself what are reasonable values? (for example: lengths and ages of
humans)
 Given one value, what could be the others? (for example: given the time
walked, what distance can possibly be covered?)

Elementary statistical plots
Dot plots / strip plots
- Good for showing actual values and structure of numerical variables
- Not suitable for large datasets
- The jitter option (i.e. slight changes in horizontal placement) may help to avoid overlapping dots

Histogram
- The range of data values is split in bins (intervals of values)
o You can choose the number of bins, or
o Choose the bin width you would like to have
- The histogram shows the number of observations in the data set for every bin (there are
versions that show percentages)
- Histograms are sensitive to bin width
o Bin width too small  too wiggly
o Bin width too large  too few details
- Rule of thumb for choosing a sensible number of bins = sqrt(n)

Cumulative histogram
- A cumulative histogram shows counts or percentages of the current bin together with the
counts or percentages of all bins to the left of that bin.
- Useful to illustrate thresholds

Bar charts versus histograms
Bar charts are for categorical data, histograms for numerical data.

Scatter plot
- Scatter plots allow to investigate relations




2

,Location summary statistics
Going beyond plots
- Plots help us to explore and give clues
- Numerical summaries like averages help us to document essential features of data sets
- One should use both plots and numerical summaries. They complement each other
- Numerical summaries are often called statistics (note the double meaning of the word: both a
scientific field and computed numbers)

Summary statistics
There are different types of summary statistics
- Level: location summary statistics  what are “typical” values
- Spread: scale summary statistics  how much do values vary?
- Relation: association summary statistics  how do values of different quantities vary
simultaneously

Location summary statistics (level)
- Mean (average)
- Median:
o Odd # of observations: middle value when ordered from small to large
o Even # of observations: average of two middle values when order from small to large
- Mode: most frequently occurring value, may be non-unique

Mean is sensitive to outliers, the median is not. Mean can be misleading / difficult to interpret for non-
symmetric distributions.

Quartiles
Re-order the data from small to large:
- 1st quartile = cut-off point for 25% of the data
- 2nd quartile = cut-off point for 50% of the data (=median)
- 3rd quartile = cut-off point for 75% of the data

Location statistics: percentiles
- Pth percentile – a cut-off point for P% of data
- we define the 0th percentile to be the minimal element of the dataset and the 100 th percentile to
be the maximal element of it.
100
- For a dataset with n observations, the 2nd smallest observation will be at th percentile
n−1




3

, Computing percentiles




Scale and Association Statistics
Scale statistics
Range: max – min
Interquartile range (IQR): 3rd quartile – 1st quartile
2
(term∈dataset −sample mean)
Sample variance: s2= ∑
N −1
2
Sample standard deviation: s=
√ ∑ (term∈dataset−sample mean)
N −1
Mean absolute deviation (MAD): median of the absolute deviation from the median




The higher these statistics, the more spread/variability in the data

Remarks about scale summary statistics
- The standard deviation has right (physical) unit
- The variance is more convenient mathematically
- The range, variance and standard deviation are sensitive to outliers, IQR and MAD are not

4
€6,89
Krijg toegang tot het volledige document:

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Maak kennis met de verkoper
Seller avatar
isabelveldhuizen

Maak kennis met de verkoper

Seller avatar
isabelveldhuizen Technische Universiteit Eindhoven
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
5
Lid sinds
4 jaar
Aantal volgers
5
Documenten
4
Laatst verkocht
2 jaar geleden

0,0

0 beoordelingen

5
0
4
0
3
0
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen