100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4.2 TrustPilot
logo-home
Resumen

Complete summary theory data analytics

Puntuación
-
Vendido
1
Páginas
22
Subido en
03-04-2022
Escrito en
2021/2022

Summary of the theory from all the lectures for data analytics for engineers.

Institución
Grado










Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
3 de abril de 2022
Número de páginas
22
Escrito en
2021/2022
Tipo
Resumen

Temas

Vista previa del contenido

2IAB0 - Data Analytics for
Engineers
Week 1: EDA
EDA = Exploratory Data Analysis

Data types:
1. Categorical data - data that has no intrinsic numerical value
• Nominal: two or more outcomes that have no natural order
• Ordinal: two or more outcomes that have a natural order

2. Numerical data - data that has an intrinsic numerical value
• Continuous data: data that can attain any value on a given measurement scale
- Interval data: equal intervals represent equal di erences
- Ratio data: both di erences and ratios make sense; it has a xed ‘ zero point’
• Discrete data: data that can only attain certain values


Tables:
Reference table: store ‘all’ data in a table so that it can be looked up easily
Demonstration table: table to illustrate a point (so present just enough data, or speci c
summary)


Plots:
• Dot plots
- Good for showing actual values and structure of numerical values
- Not suitable for large data sets
- Jitter option may help to avoid overlapping dots
• Bar chart
- For comparing some numerical characteristics of groups de ned by categories of
categorical data
- Levels of categorical variable are on the x-axis, numerical values on the y-axis
• Histogram
- Not convenient for large data sets
- Range of data is split in bins (= intervals of values)
- Histogram shows the number of observations in the data set for every bin
2
- Rule of thumb for choosing a sensible number of bins: ≈ n where n is the number of
data points
• Cumulative histogram
- Shows counts or percentages of the current bin together with the counts or percentages
of all bins to the left of that bin
• Scatter plot
- Allows to investigate relations
! Bar charts are for categorical data, histograms for numerical data


Types of summary statistics:
• Level: location summary statistics
• Spread: scale summary statistics
• Relation: association summary statistics



1


ff ff fifi fi

, Location summary statistics:
1 n
n∑
1. Mean (average): xi
i=1
2. Median:
- Odd number of observations: middle value when ordered from small to large
- Even number of observations: average of two middle values when ordered from small to
large
3. Mode: most frequently occurring value, may be non-unique

! Mean is sensitive to ‘outliers’ => mean can be misleading / di cult to interpret for non-
symmetric data sets


Quartiles:
- 1st quartile = cut-o point for 25% of the data
- 2nd quartile = cut-o point for 50% of the data (= median)
- 3rd quartile = cut-o point for 75% of the data

Percentiles:
- Pth percentile - a cut-o point for P% of the data
- We de ne the 0th percentile to be the smallest element of the dataset and the 100th percentile
to be the largest element of it
- For a dataset with n observations, the 2 smallest observation will be at 100/(n − 1)th
percentile
- For percentile P we compute its location in a data set of n
observations: Lp = 1 + (P/100)*(n-1)
- Computing Pth percentile value by linear interpolation:




Scale statistics:
• Range = max - min
• Interquartile range (IQR) = 3rd quartile - 1st quartile
n
(xi − μ)2

i=1
• Sample variance = S or
2 σ2 =
n−1
n
(xi − μ)2

i=1
Sample standard deviation = S or σ =
• n−1
• Median absolute deviation (MAD): median of the absolute deviation from the median

The higher these statistics, the more the spread/variability in the data.
! The range, variance and standard deviation are sensitive to ‘outliers’, IQR and MAD are not.
2


fi ffff ff ffi

, Standardization:
The z-score transforms data in their original units into universal statistical unit of standard
deviation from the mean. The mean value of the z-scores of data set is 0 and the standard
deviation is 1.




Negative z-score: value is below mean
Positive z-score: value is above mean

Rule of thumb: observations with a z-score larger than 2.5 are considered to be ‘outliers’.


Association statistics:
Association statistics try to capture in a number how strong the relation between two quantities is.
The sign of an association statistic indicates whether it is:
- A positive association
- A negative association

Box and whisker plot:
• Median
• 1st and 3 quartile
• Min and max values
• Endpoints of whiskers show minimum/maximum if within 1.5 IQR from the nearest 1st/3rd
quartile
• Points further away than 1.5 IQR from nearest quartile are outliers
• Yield a quick indication of symmetry
• Indicate whether there are outliers


Kernel density plots (improved histograms):
• Choose a bandwidth to be taken around each data point
• Generate a kernel with the chosen bandwidth for every data point
• Count the data points weighted by the kernel
• There is no direct interpretation of the scale of the y-axis!


Violin plot:
• Combination of box-and-whisker plot and kernel density plot:
• Global shape of box-and-whisker plot
• Local details of kernel density plot


Typical distribution shapes:
- Unimodal distribution: 1 peak
- Bimodal distribution: 2 peaks
- Symmetric distribution
- Right-skewed distribution: long tail on the right, asymmetry may indicate ‘extreme values’




3
$9.33
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Conoce al vendedor
Seller avatar
jbtue

Conoce al vendedor

Seller avatar
jbtue Technische Universiteit Eindhoven
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
7
Miembro desde
6 año
Número de seguidores
7
Documentos
11
Última venta
1 año hace

0.0

0 reseñas

5
0
4
0
3
0
2
0
1
0

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes