Summary

Samenvatting Data Analytics (INFODA)

Rating

Sold

Pages

Uploaded on

27-05-2021

Written in

2020/2021

All substance before the exam Data Analytics

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Universiteit Utrecht (UU)
Study: Informatiekunde
Course: Data Analytics (INFODA)

All documents for this subject (3)

Document information

Uploaded on: May 27, 2021
Number of pages: 40
Written in: 2020/2021
Type: Summary

Subjects

data analytics
information science

Content preview

SV Data Analytics

Lecture 1 – Introduction
Knowledge Discovery in Databases (KDD)
➢ The process of (semi-) automatic extraction of knowledge from databases/ process of
discovering useful knowledge from a collection of data, which is
o Valid
o Previously unknown
o Potentially useful
➢ Interdisciplinary field:
o Database systems
▪ Scalability for large datasets – integration from different sources – novel data
types (text)
o Statistics
▪ Probabilistic knowledge – model-based inferences – evaluation of knowledge
o Machine learning
▪ Different paradigms of learning – supervised learning – hypothesis spaces
and search strategies
➢ KDD Process Model

Visual Analytics
➢ Data → visualization → gain insights
➢ Importance of visualization: make both calculations and
graphs. Both sorts of output should be studies; each will
contribute to understanding.
➢ Goals of visualization:
o Presentation
▪ Starting point: facts to be presented are a fixed priority
▪ Process: choice of appropriate presentation techniques
▪ Result: high-quality visualization of the data to present facts
o Confirmatory analysis
▪ Starting point: hypotheses about the data
▪ Process: goal-oriented examination of the hypotheses
▪ Result: visualization of data to confirm or reject the hypotheses
o Exploratory analysis
▪ Starting point: no hypotheses about the data

1

, ▪ Process: interactive, usually
undirected search for structures,
trends
▪ Result: visualization of data to
lead to hypotheses about the data
➢ Visualization: the process of presenting data in a
form that allows rapid understanding of
relationships and findings that are not readily
evident from raw data
➢ 2 ways of going through conceptual pipeline (data
→ visualization OR data → models)
➢ Sense making loop → not a one-way street, but a loop => knowledge generation loop

Lecture 2 – Data Foundations 1
Types of data
➢ Data can be gathered/ generated from many sources. Independent of the source, each data
point has a data type
o Nominal & ordinal => categorial or discrete values
o Numeric => continuous scale
➢ Nominal
o Discrete; not the same values, but no specific ranking →classification without order
(ID, gender)
o No quantitive relationship between categories
➢ Ordinal
o noise comparison; difference in values (one is louder); rank order → attributes can be
rank-ordered
o distance can be arbitrary (smoking habits) → distances between values do not have
any meaning
➢ Numeric
o difference in height; attributes can be rank-ordered
o distances between values have a meaning
o calculations with the data are possible! (height of X = height of Y+5/2); meaningful
distance between values where mathematical operations are possible (age, time)

Typical data classes
➢ Scalar: an individual number in a data record
➢ Multivariate and Mulitdimensional data: multiple variables within a single record can
represent a composite data item; not always easy to calculate difference (bv gender and
weight comparison)
➢ Vector: it is common to treat the vector as a whole; telephone number that can be divided into
country/ region code
➢ Network data: vertices on a surface are connected to their neighbors via edges
➢ Hierarchical data: relationships between nodes in a hierarchy can be specified by links
➢ Time-series data: a complex way of looking into data; time has the widest range of possible
values
o Ducks
o Ordinal: gender
o Numeric: amount
o Vector: location
o Network: parent/child
o Hierarchical: leader/ follower
o Time series: movement

2

,Data preprocessing: Data cleaning
➢ Rubbish in – rubbish out. You have to be certain that you will do data cleaning (treat missing
values) → low-quality data will lead to low-quality mining results
➢ Data cleaning → missing values:
o ignore the tuple (hele rij verwijderen)
▪ + easily done, no computational effort
▪ - loss of information, unnecessary if the attribute is not needed
o fill in the missing value manually
▪ + effective for small datasets
▪ - need to know the value, time consuming, not feasible with large datasets
o use a global constant (-1: don’t use this value for calculations for algorithm)
▪ + can be easily done, perhaps interesting to know the missing value
o use the attribute mean
▪ + simple to implement
▪ - not the most accurate approximation of the value
o use the most probable value
▪ + most accurate approximation of the value
▪ - most computational effort
➢ Data cleaning → noisy data: a random error or variance in a measured variable
o Smooth out the noise!
o Systematic error: sensor always senses a little bit higher – same frequency curve but
shifts to a direction. Average is the same

o Handling noisy data:
▪ Binning: sort data and partition into (equi-depth) bins and then
smooth by bin means, bin median, bin boundaries, etc.
• Equal-width binning:
o Divides the range into N intervals of equal size
o Width of the intervals: width = (max-min)/ N
o Simple
o Outliers may dominate result
• Equal-depth binning:
o Divides the range into N intervals
o Each interval contains approximately the same
number of records
o Skewed data is also handled well

3

, ▪ Regression: smooth out noise by fitting a regression function
o Assume our data can be modelled ‘easily’
o Global linear regression models may not be adequate for
“nonlinear” data
o The regression model can be static or dynamic
▪ Static: using only the historical data to calculate the
function
▪ Dynamic: also use new data to adapt the model
• Linear regression
o Tries to discover the parameters of the straight-line equation
that best fits the data point → line that reduces the squared
error of all data points

➢ B = 0.6857 is
a very slight slope.

• Non-linear regression (slides)
▪ Clustering: cluster data and remove outliers
(automatically or via human inspection)

Lecture 3 – Data Foundations 2
Best regression line is closest to the points. Almost touching points
sometimes, but the distance is never huge (outlier).
Continuing on Data Preprocessing. Data cleaning discussed, now:

Data Preprocessing: Norminalisation
➢ Linear normalization
➢ Square root normalization (overall wortel van)
➢ Logarithmic normalization (ln() van alles)
➢ Possible solutions for data streams (problem when adding data
to the table, e.g. new min/ max values)
o Rerun the normalization
+ overall correct data representation
- computationally expensive
- perception of previous results distorted

4

$10.18

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

IsabelleU

3.8

(4)

Get to know the seller

IsabelleU Universiteit Utrecht

View profile

Sold

133

Member since

4 year

Number of followers

Documents

Last sold

1 month ago

3.8

4 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller IsabelleU. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $10.18. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 50201 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 15 years now

Samenvatting Data Analytics (INFODA)

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?