Resumen

Samenvatting Data Analytics (INFODA)

Puntuación

Vendido

Páginas

Subido en

27-05-2021

Escrito en

2020/2021

Alle stof voor het tentamen DA. Inclusief afbeeldingen enz

Institución

Grado

Vista previa del contenido

SV Data Analytics

Lecture 1 – Introduction
Knowledge Discovery in Databases (KDD)
➢ The process of (semi-) automatic extraction of knowledge from databases/ process of
discovering useful knowledge from a collection of data, which is
o Valid
o Previously unknown
o Potentially useful
➢ Interdisciplinary field:
o Database systems
▪ Scalability for large datasets – integration from different sources – novel data
types (text)
o Statistics
▪ Probabilistic knowledge – model-based inferences – evaluation of knowledge
o Machine learning
▪ Different paradigms of learning – supervised learning – hypothesis spaces
and search strategies
➢ KDD Process Model

Visual Analytics
➢ Data → visualization → gain insights
➢ Importance of visualization: make both calculations and
graphs. Both sorts of output should be studies; each will
contribute to understanding.
➢ Goals of visualization:
o Presentation
▪ Starting point: facts to be presented are a fixed priority
▪ Process: choice of appropriate presentation techniques
▪ Result: high-quality visualization of the data to present facts
o Confirmatory analysis
▪ Starting point: hypotheses about the data
▪ Process: goal-oriented examination of the hypotheses
▪ Result: visualization of data to confirm or reject the hypotheses
o Exploratory analysis
▪ Starting point: no hypotheses about the data

1

, ▪ Process: interactive, usually
undirected search for structures,
trends
▪ Result: visualization of data to
lead to hypotheses about the data
➢ Visualization: the process of presenting data in a
form that allows rapid understanding of
relationships and findings that are not readily
evident from raw data
➢ 2 ways of going through conceptual pipeline (data
→ visualization OR data → models)
➢ Sense making loop → not a one-way street, but a loop => knowledge generation loop

Lecture 2 – Data Foundations 1
Types of data
➢ Data can be gathered/ generated from many sources. Independent of the source, each data
point has a data type
o Nominal & ordinal => categorial or discrete values
o Numeric => continuous scale
➢ Nominal
o Discrete; not the same values, but no specific ranking →classification without order
(ID, gender)
o No quantitive relationship between categories
➢ Ordinal
o noise comparison; difference in values (one is louder); rank order → attributes can be
rank-ordered
o distance can be arbitrary (smoking habits) → distances between values do not have
any meaning
➢ Numeric
o difference in height; attributes can be rank-ordered
o distances between values have a meaning
o calculations with the data are possible! (height of X = height of Y+5/2); meaningful
distance between values where mathematical operations are possible (age, time)

Typical data classes
➢ Scalar: an individual number in a data record
➢ Multivariate and Mulitdimensional data: multiple variables within a single record can
represent a composite data item; not always easy to calculate difference (bv gender and
weight comparison)
➢ Vector: it is common to treat the vector as a whole; telephone number that can be divided into
country/ region code
➢ Network data: vertices on a surface are connected to their neighbors via edges
➢ Hierarchical data: relationships between nodes in a hierarchy can be specified by links
➢ Time-series data: a complex way of looking into data; time has the widest range of possible
values
o Ducks
o Ordinal: gender
o Numeric: amount
o Vector: location
o Network: parent/child
o Hierarchical: leader/ follower
o Time series: movement

2

,Data preprocessing: Data cleaning
➢ Rubbish in – rubbish out. You have to be certain that you will do data cleaning (treat missing
values) → low-quality data will lead to low-quality mining results
➢ Data cleaning → missing values:
o ignore the tuple (hele rij verwijderen)
▪ + easily done, no computational effort
▪ - loss of information, unnecessary if the attribute is not needed
o fill in the missing value manually
▪ + effective for small datasets
▪ - need to know the value, time consuming, not feasible with large datasets
o use a global constant (-1: don’t use this value for calculations for algorithm)
▪ + can be easily done, perhaps interesting to know the missing value
o use the attribute mean
▪ + simple to implement
▪ - not the most accurate approximation of the value
o use the most probable value
▪ + most accurate approximation of the value
▪ - most computational effort
➢ Data cleaning → noisy data: a random error or variance in a measured variable
o Smooth out the noise!
o Systematic error: sensor always senses a little bit higher – same frequency curve but
shifts to a direction. Average is the same

o Handling noisy data:
▪ Binning: sort data and partition into (equi-depth) bins and then
smooth by bin means, bin median, bin boundaries, etc.
• Equal-width binning:
o Divides the range into N intervals of equal size
o Width of the intervals: width = (max-min)/ N
o Simple
o Outliers may dominate result
• Equal-depth binning:
o Divides the range into N intervals
o Each interval contains approximately the same
number of records
o Skewed data is also handled well

3

, ▪ Regression: smooth out noise by fitting a regression function
o Assume our data can be modelled ‘easily’
o Global linear regression models may not be adequate for
“nonlinear” data
o The regression model can be static or dynamic
▪ Static: using only the historical data to calculate the
function
▪ Dynamic: also use new data to adapt the model
• Linear regression
o Tries to discover the parameters of the straight-line equation
that best fits the data point → line that reduces the squared
error of all data points

➢ B = 0.6857 is
a very slight slope.

• Non-linear regression (slides)
▪ Clustering: cluster data and remove outliers
(automatically or via human inspection)

Lecture 3 – Data Foundations 2
Best regression line is closest to the points. Almost touching points
sometimes, but the distance is never huge (outlier).
Continuing on Data Preprocessing. Data cleaning discussed, now:

Data Preprocessing: Norminalisation
➢ Linear normalization
➢ Square root normalization (overall wortel van)
➢ Logarithmic normalization (ln() van alles)
➢ Possible solutions for data streams (problem when adding data
to the table, e.g. new min/ max values)
o Rerun the normalization
+ overall correct data representation
- computationally expensive
- perception of previous results distorted

4

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Universiteit Utrecht (UU)
Estudio: Informatiekunde
Grado: Data Analytics (INFODA)

Todos documentos para esta materia (3)

Información del documento

Subido en: 27 de mayo de 2021
Número de páginas: 40
Escrito en: 2020/2021
Tipo: RESUMEN

Temas

data analytics
informatiekunde

$10.69

Accede al documento completo:

Escrito por estudiantes que aprobaron

Inmediatamente disponible después del pago

Leer en línea o como PDF

Conoce al vendedor

IsabelleU

3.8

(4)

Conoce al vendedor

IsabelleU Universiteit Utrecht

Ver perfil

Seguir

Vendido

138

Miembro desde

4 año

Número de seguidores

Documentos

Última venta

2 semanas hace

3.8

4 reseñas

Documentos populares

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller IsabelleU. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $10.69. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 16 years now