Resumen

Summary Data Analytics for engineers (2IAB0)

Puntuación

Vendido

Páginas

Subido en

08-11-2021

Escrito en

2020/2021

The document is a summary written about the course data analytics for engineers. In the document, there is an explanation about every subject from the lectures and the assignments. The explanation is mostly with written text and pictures.

Mostrar más Leer menos

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Technische Universiteit Eindhoven (TUE)
Estudio: Psychology And Technology
Grado: 2IAB0 Data Analytics (2IAB0)

Todos documentos para esta materia (1)

Información del documento

Subido en: 8 de noviembre de 2021
Número de páginas: 32
Escrito en: 2020/2021
Tipo: Resumen

Temas

data analytics
review
samenvatting
data analytics for engineers
summary

Vista previa del contenido

Summary data analytics for engineers

EDA exploratory data analysis
What is data?
- We will say data referring to raw, unorganized numbers, facts etc. and use the word
information for structured, meaningful and useful numbers and facts

Data forms / types
- Numerical data
o continuous data – data that can attain any value on a given measurement
scale
▪ interval data - continuous data for which only differences have
meaning, no fixed “zero point”. (temperature / pH)
▪ ratio – continuous data for which ratio makes sense, has fixed “zero
point”, so ratios also doe make sense (budget for a movie)
o discrete data – data that can only attain certain values (integers)
- categorical data
o data that has no intrinsic numerical value
▪ nominal: two or more outcomes that have no natural order. (movie
genre, hair color)
▪ ordinal: two or more outcome that have a natural order. (movie rating)

Tables
- tables are good
o for reading off values
o to draw attention to actual values
- reference table; store “all” data in a table so that it can be
looked up easily

- demonstration table: table to illustrate a point (so present just
enough data)

turkey promoted to use graphs to explore data before using more advanced
key feature of EDA:
- getting to know the data before doing further analysis
- extensively using graphs
- generating questions
- detecting errors in data

what do we expect
- asking what to expect is also an important way to spot errors
- what are reasonable values?
- Given one value, what could be the others?

Dot plots/strip plots
- Good for showing actual values and structure of
numerical variables
- Not suitable for large data sets
- The jitter option may help avoid overlapping dots

,Histogram: distribution of numerical data
- The range of data values is split in bins (intervals of values)
o You can shoose the number of bins
o Choose the bin width you would like to have
- The histogram show the number of observations in the data
set for every bin
- Histogram are sensitive to bin width
o Bin width too small → too wiggly
o Bin width too large → too few details
- Rule of thumb for choosing sensible number of bins = √𝑛

Cumulative histogram
- A cumulative histogram shows count of percentages of the current
bin together with the counts or percentages of all binds to the left
of that bin
- We read of here that approximately 97% of the movies have a
budget not exceeding 100 million dollar
- Useful to illustrate thresholds

Bar charts and histograms
- Bar charts are for categorical data, histograms are for numerical data

Scatter plot
- Scatter plot allow to investigate relations
- Here we can see that a higher budget typically means a
higher profit
- For movies with a smaller budget, there is a lot of uncertainty

Location summary statistics
- Plots help us to explore and give clues
- Numerical summaries like average help us to document essential features of data
sets
- One should use both plots and numerical summaries, they complement each other
- Numerical summaries are often called statistics

Summary statistics
- There are different types of summary statistics
o Level: location summary statistics → what are “typical” values
o Spread: scale summary statistics → how much do values vary?
o Relation: association summary statistics → how do values of different
quantities vary simultaneously

Location summary statistics
- Mean (average) :
- Median :middle number
o Odd of observations: middle value when ordered from small to large
o Even of observations: average of two middle values when order from small to
large
- Mode: most frequently occurring value, may be non-unique
- Mean is sensitive for outliers, the median is not
- Mean can be misleading / difficult to interpret for non-symmetric distributions

,Quartiles
- Re-order the data from small to large
- 1st quartile = cut off point for 25% of the data
- 2nd quartile = cut off point for 50% of the data = median
- 3rd quartile = cut off point for 75% of the data

Location statistics : percentiles
- P percentile – a cut-off pint for p% of data
- We define the 0th percentile to be the minimal element of the dataset
- And the 100th percentile to be the maximal element of it
- For a dataset with n observations, the 2nd smallest observation will be at 100 / (n – 1)
percentile

Computing percentiles
- For a percentile P we compute its location in a data set of n observations:
𝑃
o 𝐿𝑝 = 1 + (𝑛 − 1)
100
- Computing P percentile value by linear interpolation

- Example:

Scale statistics
- Range = max – min
- Interquartile range (IQR) = 3rd quartile – 1st quartile
- Sample variance =
-
- Sample standard deviation
-
- Median absolute deviation (MAD) = median of the absolute deviation from the
median
- The higher these statistics, the more spread / variability in the data

Remarks about scale summary statistics
- The standard deviation has right unit
- The variance is more convenient mathematically
- The range, variance and standard deviation are sensitive to “outliers”, IQR and MAD
are not
- The standard deviation can be used as a general unit to describe variability

Standardardization (z-score normalization)
- Z-score transforms data in their original units into universal statistical
unit of standard deviation from the mean
- The mean value of the transformed data set is 0 and the standard deviation is 1
- Negative z-score → the value below the mean
- Positive z-score → value above the mean
- Rule of thumb: observations with a z-score larger
than 2.5 are considered to be extreme (“outliers”)

, Association statistics
- Association statistics try to capture in a number how strong the relation between two
quantities is
- The sign of a association statistics indicate whether it is
o A positive association (higher → higher)
o A negative association (higher → less)

Sample correlation
- Sample covariance:

- Sample correlation:

- “No” relation: Rxy close to 0
- “perfect” relation: Rxy close to -1 (negative correlation) or 1 (positive correlation)

Summary statistics and data types (nominal, ordinal, interval, ratio)

Advanced statistical plots

Typical distribution shapes
- unimodal distribution (1 peak)
- bimodal distribution (2 peaks, not necessarily the same),
possible due to 2 different groups that depending on the
context should not be combined
- symmetric distribution: there is no precise definition of
symmetry
- right-skewed distribution (also knows als positive skewed
because long tail on the right) asymmetry may indicate
“extreme” values. = positive skewed
o Mean > median and median closer to first quartile

Assessing the shape
- The fixed bins and choice of bin locations make it difficult to
accurately asses the shape of a data set
- This can be overcome to let the bin move along with the
data (gliding histogram)
- A more advanced way is to use a kernel function. The
gliding histogram corresponds to the uniform case, giving
equal weight to all the data points within the bin

$6.57

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

maritvanderlit

Conoce al vendedor

maritvanderlit Technische Universiteit Eindhoven

Ver perfil

Seguir

Vendido

Miembro desde

4 año

Número de seguidores

Documentos

Última venta

3 año hace

0.0

0 reseñas

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller maritvanderlit. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $6.57. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 15 years now