100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Samenvatting Data Mining and its Applications (EBB056B05)

Rating
4.0
(1)
Sold
11
Pages
96
Uploaded on
24-06-2024
Written in
2023/2024

Summary of the Data Mining and its Applications lectures, all slides of all lectures are included here and supplemented with material from the ChatGPT book/explanation. I myself got an 8.5 in the exam with this summary included.

Institution
Course











Whoops! We can’t load your doc right now. Try again or contact support.

Connected book

Written for

Institution
Study
Course

Document information

Summarized whole book?
Yes
Uploaded on
June 24, 2024
Number of pages
96
Written in
2023/2024
Type
Summary

Subjects

Content preview

Lecture 1............................................................................................................................... 3
Lecture 2: Regression..........................................................................................................8
R-squared vs. RMSE.................................................................................................... 10
Linear regression:....................................................................................................... 11
Polynomial regression:................................................................................................12
Regression tree: the algorithm....................................................................................12
Bootstrap AGGregating (Bagging): for each tree/model a training ste is generated by
sampling uniformly with replacement from the standard training set...........................13
Generalization............................................................................................................. 16
Advantages of 5-Fold Cross-Validation...................................................................17
Lecture 3: Time series analysis.......................................................................................... 17
Seasonal effect:..........................................................................................................18
Exponential smoothing............................................................................................... 21
Stationarity................................................................................................................ 22
A seasonal difference is the difference between an observation and the corresponding
observation from the previous (seasonal) cycle...........................................................23
ARIMA Models:........................................................................................................... 24
Sequence segmentation.............................................................................................29
Characteristics of a time series................................................................................... 31
Lecture 4: clustering......................................................................................................... 32
Hierarchical Clustering (Linkage-Based Clustering).................................................... 32
K-Means Clustering (Model-Based Clustering).............................................................32
Density-Based Clustering (DBScan)............................................................................ 33
Example:...............................................................................................................34
Importance of MinPts:...........................................................................................34
Clustering Evaluation..................................................................................................34
Attribute Weighting.................................................................................................... 46
Prototype & model-based (k-means,... clustering).......................................................47
Partitioning; goal: a (disjoint) partitioning into k clusters with minimal costs.............. 47
K-means.....................................................................................................................48
Outliers: k-means vs. k-medoids.................................................................................48
Density-based clustering............................................................................................49
Clustering evaluation...................................................................................................51
Lecture 5: Classifiers; Decision Trees, Model validation...................................................56
Decision Trees............................................................................................................56


1

, Evaluation measures - Shannon Entropy.....................................................................63
Gain Ratio...................................................................................................................70
Gini Index.................................................................................................................... 71
x^2 measure............................................................................................................... 72
Decision Trees - Missing Values...................................................................................73
Pruning.......................................................................................................................74
Reduced Error Pruning................................................................................................76
Pessimistic Pruning.................................................................................................... 76
Model Validation......................................................................................................... 78
Lecture 6: Additional topics on Data Mining......................................................................86
Lecture 7: overview............................................................................................................ 91
ChatGPT..............................................................................................................................92
Example Usage..................................................................................................... 92
Row Splitter Node............................................................................................92
Partitioning Node............................................................................................ 92
Practical Example................................................................................................. 93
How Gain Ratio is Calculated:................................................................................ 93
Example Use:........................................................................................................ 93
How Gini Index is Calculated:.................................................................................94
Purpose of the Gini Index:..................................................................................... 94
Example Use:........................................................................................................94
Characteristics of String Variables........................................................................ 95
Use in Data Mining................................................................................................. 95
Handling String Variables...................................................................................... 95
Example................................................................................................................96




2

,Lecture 1
What is data mining?
→ the extraction of interesting information or patterns from large data sets, which may originally have been
developed for other purposes.

Data states:
● Data at rest
● Data on the move
● Data in use

From data to knowledge:




Data mining project understanding
- What is the primary objective?
- What are the criteria for success?



3

, - These are difficult to define
- Stakeholders involved in the data analysis/mining process speak different languages




Data Mining Stakeholders
● Business User: business understanding
○ Has a sound understanding of the business domain targeted by the data mining project. The
person can offer insight into the project context, the business value sought to be extracted via
data mining and advise on how results can be operationalized.
● Project Sponsor: project driver
○ The initiator or driver for the data mining project. Concerned with the potential ROI and sets
priorities and desired outputs. This person is championing the project, motivating
engagement of key personnel around the business problem.
● Project Manager: end-to-end project delivery
○ In charge for the data mining project implementation and is concerned with meeting goals for
quality, time and budget targets.
● Business Intelligence Analyst: data understanding
○ Bridge between the data and the business view of the targeted problem. Maintaining a sound
understanding of relevant data, the Business Intelligence Analyst is driving activities related to
Key Performance Indicators (KPIs) and extracting relevant data for reporting and dashboarding
purposes. Understands sources and ‘consumers’ of data, as well as need for changes in data
management processes
● Data Administrator & Integrator: data preparation & solution delivery
○ Provides action support for implementing key data access and processing activities, needed
by stakeholders of the data mining project. A technical person with sound data management
competences, including awareness of security and/or privacy concerns would be appropriate.
● Data Scientist/Engineer: data modeling of evaluation
○ This person combines data management skills with a sound understanding of data analysis
methods and tools and is driving the ingestion of data into the overall data analytics process.
The data scientist is able to communicate the analytics methods to the other stakeholders.
→ the data engineer and administrator + integrator are working closely on the technical side of data mining
and share relevant code and documentation.

Data Mining Project Workflow
1. Inception and discovery
a. Tool to sketch beliefs, experiences, known factors
b. How often will a certain product be found in a basket?
2. Data preparation




4

Reviews from verified buyers

Showing all reviews
5 months ago

4.0

1 reviews

5
0
4
1
3
0
2
0
1
0
Trustworthy reviews on Stuvia

All reviews are made by real Stuvia users after verified purchases.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
donnakartoidjojo Rijksuniversiteit Groningen
Follow You need to be logged in order to follow users or courses
Sold
43
Member since
3 year
Number of followers
19
Documents
12
Last sold
1 month ago

4.3

3 reviews

5
1
4
2
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions