100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4,6 TrustPilot
logo-home
Resumen

Samenvatting Data Analytics (INFODA)

Puntuación
-
Vendido
13
Páginas
40
Subido en
27-05-2021
Escrito en
2020/2021

Alle stof voor het tentamen DA. Inclusief afbeeldingen enz

Institución
Grado











Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
27 de mayo de 2021
Número de páginas
40
Escrito en
2020/2021
Tipo
Resumen

Temas

Vista previa del contenido

SV Data Analytics

Lecture 1 – Introduction
Knowledge Discovery in Databases (KDD)
➢ The process of (semi-) automatic extraction of knowledge from databases/ process of
discovering useful knowledge from a collection of data, which is
o Valid
o Previously unknown
o Potentially useful
➢ Interdisciplinary field:
o Database systems
▪ Scalability for large datasets – integration from different sources – novel data
types (text)
o Statistics
▪ Probabilistic knowledge – model-based inferences – evaluation of knowledge
o Machine learning
▪ Different paradigms of learning – supervised learning – hypothesis spaces
and search strategies
➢ KDD Process Model




Visual Analytics
➢ Data → visualization → gain insights
➢ Importance of visualization: make both calculations and
graphs. Both sorts of output should be studies; each will
contribute to understanding.
➢ Goals of visualization:
o Presentation
▪ Starting point: facts to be presented are a fixed priority
▪ Process: choice of appropriate presentation techniques
▪ Result: high-quality visualization of the data to present facts
o Confirmatory analysis
▪ Starting point: hypotheses about the data
▪ Process: goal-oriented examination of the hypotheses
▪ Result: visualization of data to confirm or reject the hypotheses
o Exploratory analysis
▪ Starting point: no hypotheses about the data

1

, ▪ Process: interactive, usually
undirected search for structures,
trends
▪ Result: visualization of data to
lead to hypotheses about the data
➢ Visualization: the process of presenting data in a
form that allows rapid understanding of
relationships and findings that are not readily
evident from raw data
➢ 2 ways of going through conceptual pipeline (data
→ visualization OR data → models)
➢ Sense making loop → not a one-way street, but a loop => knowledge generation loop


Lecture 2 – Data Foundations 1
Types of data
➢ Data can be gathered/ generated from many sources. Independent of the source, each data
point has a data type
o Nominal & ordinal => categorial or discrete values
o Numeric => continuous scale
➢ Nominal
o Discrete; not the same values, but no specific ranking →classification without order
(ID, gender)
o No quantitive relationship between categories
➢ Ordinal
o noise comparison; difference in values (one is louder); rank order → attributes can be
rank-ordered
o distance can be arbitrary (smoking habits) → distances between values do not have
any meaning
➢ Numeric
o difference in height; attributes can be rank-ordered
o distances between values have a meaning
o calculations with the data are possible! (height of X = height of Y+5/2); meaningful
distance between values where mathematical operations are possible (age, time)

Typical data classes
➢ Scalar: an individual number in a data record
➢ Multivariate and Mulitdimensional data: multiple variables within a single record can
represent a composite data item; not always easy to calculate difference (bv gender and
weight comparison)
➢ Vector: it is common to treat the vector as a whole; telephone number that can be divided into
country/ region code
➢ Network data: vertices on a surface are connected to their neighbors via edges
➢ Hierarchical data: relationships between nodes in a hierarchy can be specified by links
➢ Time-series data: a complex way of looking into data; time has the widest range of possible
values
o Ducks
o Ordinal: gender
o Numeric: amount
o Vector: location
o Network: parent/child
o Hierarchical: leader/ follower
o Time series: movement

2

,Data preprocessing: Data cleaning
➢ Rubbish in – rubbish out. You have to be certain that you will do data cleaning (treat missing
values) → low-quality data will lead to low-quality mining results
➢ Data cleaning → missing values:
o ignore the tuple (hele rij verwijderen)
▪ + easily done, no computational effort
▪ - loss of information, unnecessary if the attribute is not needed
o fill in the missing value manually
▪ + effective for small datasets
▪ - need to know the value, time consuming, not feasible with large datasets
o use a global constant (-1: don’t use this value for calculations for algorithm)
▪ + can be easily done, perhaps interesting to know the missing value
o use the attribute mean
▪ + simple to implement
▪ - not the most accurate approximation of the value
o use the most probable value
▪ + most accurate approximation of the value
▪ - most computational effort
➢ Data cleaning → noisy data: a random error or variance in a measured variable
o Smooth out the noise!
o Systematic error: sensor always senses a little bit higher – same frequency curve but
shifts to a direction. Average is the same




o Handling noisy data:
▪ Binning: sort data and partition into (equi-depth) bins and then
smooth by bin means, bin median, bin boundaries, etc.
• Equal-width binning:
o Divides the range into N intervals of equal size
o Width of the intervals: width = (max-min)/ N
o Simple
o Outliers may dominate result
• Equal-depth binning:
o Divides the range into N intervals
o Each interval contains approximately the same
number of records
o Skewed data is also handled well




3

, ▪ Regression: smooth out noise by fitting a regression function
o Assume our data can be modelled ‘easily’
o Global linear regression models may not be adequate for
“nonlinear” data
o The regression model can be static or dynamic
▪ Static: using only the historical data to calculate the
function
▪ Dynamic: also use new data to adapt the model
• Linear regression
o Tries to discover the parameters of the straight-line equation
that best fits the data point → line that reduces the squared
error of all data points




➢ B = 0.6857 is
a very slight slope.


• Non-linear regression (slides)
▪ Clustering: cluster data and remove outliers
(automatically or via human inspection)


Lecture 3 – Data Foundations 2
Best regression line is closest to the points. Almost touching points
sometimes, but the distance is never huge (outlier).
Continuing on Data Preprocessing. Data cleaning discussed, now:

Data Preprocessing: Norminalisation
➢ Linear normalization
➢ Square root normalization (overall wortel van)
➢ Logarithmic normalization (ln() van alles)
➢ Possible solutions for data streams (problem when adding data
to the table, e.g. new min/ max values)
o Rerun the normalization
+ overall correct data representation
- computationally expensive
- perception of previous results distorted

4
$10.95
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Conoce al vendedor

Seller avatar
Los indicadores de reputación están sujetos a la cantidad de artículos vendidos por una tarifa y las reseñas que ha recibido por esos documentos. Hay tres niveles: Bronce, Plata y Oro. Cuanto mayor reputación, más podrás confiar en la calidad del trabajo del vendedor.
IsabelleU Universiteit Utrecht
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
136
Miembro desde
4 año
Número de seguidores
86
Documentos
34
Última venta
4 semanas hace

3.8

4 reseñas

5
2
4
0
3
1
2
1
1
0

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes