100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Data Mining 2017/2018 - Short Summary

Rating
-
Sold
3
Pages
4
Uploaded on
10-01-2018
Written in
2017/2018

Short summary (samenvatting) Data Mining Data Science Regression Classification Clustering Dimensionality Reduction

Institution
Course








Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
January 10, 2018
Number of pages
4
Written in
2017/2018
Type
Summary

Subjects

Content preview

Data Mining Essentials
Supervised vs Unsupervised Learning
- Supervised learning
o Classification (cat | dog | mouse)
o Regression (24 | 3 | 32 | 10)
- Unsupervised ‘learning’
o Clustering ( a b c | k l m | x y z)
o Dimensionality reduction (X1, X2, X3, X4, X5  –X3, –X5)

Overall goal of both methods: extract from dataset with goal to generalize.

Supervised Learning
- Training set with vectors | categorised (colours)
- Flowchart: raw data collection » pre-processing » sampling » re-processing » learning
algorithm training » hyperparameter optimisation » post-processing » final classification /
regression model

Pre-processing
Feature transformation:

- Categorical variables
o Nominal (green » [0,1,0])
o Ordinal (XL » 3)
- Normalisation and outlier removal
o Z-score (mean/SD)
o Remove outliers (depends on your goal)
- Vector normalisation
o L2-norm (√∑x²)  ○
o L1-norm (∑|x|)  ◊

Data Exploration and Visualisation (descriptive analysis)
- Sort or rearrange your data
- Goal of thesis: how well following the guidelines?

Splitting your data
- The fundamental goal is to generalize beyond the data instances used to train models
- Never touch the test data (until the end)
- Test data must belong to the same (statistical) distribution as the training data!
1. Sequential Split: for example a time series, typically train on a period, for example one 1-6
and test on 7-8. Common pitfall is cycles in the data (on different time-scales).
2. Random Split: blindly assign instances to training…….

Sampling and splitting your data
- In the case of small data, you want to check
(stratify) your data in terms of target, or at
least check if the ratios are representative.
- In the case of unbalanced data you might
want to stratify your data.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
JHessels Tilburg University
Follow You need to be logged in order to follow users or courses
Sold
49
Member since
7 year
Number of followers
33
Documents
9
Last sold
1 year ago

2.5

6 reviews

5
0
4
1
3
3
2
0
1
2

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions