100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary Data Mining | Midterm week 1-3

Rating
5.0
(1)
Sold
5
Pages
30
Uploaded on
25-02-2020
Written in
2019/2020

This summary includes all material of week 1-3. It serves for the first midterm of this course. * Lectures notes: 1-3

Institution
Course










Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
February 25, 2020
Number of pages
30
Written in
2019/2020
Type
Summary

Subjects

Content preview

Week 1
Slides
Data Mining for Business & Governance

Lecture 1: What is Data Mining?
Data mining is the computational process of discovering patterns in large data
sets involving methods at the intersection of artificial intelligence, machine learning,
statistics, and database systems

It is about extracting novel, interesting and potentially useful knowledge.

(main) relations to:
• Knowledge discovery in databases
• Machine learning → branch of computer science studying learning from data
• Statistics → branch of mathematics focused on data
• Artificial intelligence → interdisciplinary field aiming to develop intelligent
machines

Key aspects
• Computation vs large data sets: there is a trade-off to be made between
processing time and memory
• Computation enables analysis of large data sets: computers as a tool and with growing data → design
efficient computation methods to work on data to extract and give meaning to knowledge.
• Data mining often implies knowledge discovery from data bases: from unstructured data to
structured knowledge.
- Unstructured data: text
- Semi structured data: html page due to the tags which give us some more information
- Structured data: tables

What are large amounts or big data? (definition is always changing)
→ Current opinion: we should have smaller datasets, so we can enrich them, give them a higher quality
Volume Variety Velocity
• Too big for manual • Range of values: variance • Data changes quickly:
analysis • Outliers, confounders and require results before data
• Too big to fit in RAM noise changes
• Too big to store on disk • Different data types • Streaming data (no
storage)

Application of data mining
Companies: business intelligence → market Science: knowledge discovery → scientific
analysis and management discovery in large data
• Target marketing, CRM • DNA: sequence data
• Risk analysis and management • SETI program, time series
• Forecasting, customer retention, quality • Electronic Health Records
control, competitive analysis • Social Network Analysis
• Fraud detection and management • Text Mining (natural language
• AH bonus card, Amazon, Mastercard, processing): going from unstructured
Booking.com text → structured knowledge


What makes prediction possible?
Make sure of some structure in the data!
• Associations between features/target
• Association features in numerical variables: correlation coefficient
• Categorical: mutual information value of X1, contains information about value of X2


Different types of learning
? A program is said to learn from experience (E) on task (T) and a performance measure (P), if its performance
at tasks in T as measured by P improves with E.
• Supervised learning – label
= You train the machine in using data which is well ‘labeled’ --> so you are
mapping from the input to the essential output

- Classification: because we have a label, we could try to get a model
to classify different classes of diseases.
- Regression: when we have numerical data, e.g. specifying the risk
of getting a disease




1

,• Unsupervised learning – no labels
= We don’t know anything about the data; you are not aiming to produce output in the response of the input.
Instead, you want to discover patterns in the data.

- Dimensionality reduction: large number of attributes, we could try to reduce to the most
relevant/interesting ones.
- Clustering: you will investigate similar groups of patients

Inductive learning for algorithms: learns from samples/ training data / trial and error

Supervised learning workflow for algorithms




1. Collect data
• How do you select your sample?
4. Train model(s)
• Reliability of measurement
• Keep some examples for final evaluation: test
• Privacy and other regulations
set
• Use the rest for:
2. Label examples
- Learning: training set
• Annotation guidelines
- Tuning: validation set
• Measure inter-annotator agreement
• Crowdsourcing
Parameter or model tuning
• Learning algorithms typically have setting (aka
3. Choose representation
hyperparameters)
• Features: attributes describing examples
• For each value of hyperparameters:
- Numerical or categorical (binary)
- Apply algorithm to training set to learn
• Possibly convert to feature vector
- Check performance on validation set
- A vector is a fixed-size list of numbers
- Find/choose best-performing setting
- Feature vector: describes the object that
you want to use.
5. Evaluate
- Some learning algorithms require
• Check performance of tuned model on test set
examples represented as vectors →
• Goal: estimate how well your model will do in
spectra representation
the real world
• Keep evaluation realistic
• Decision tree models, neural networks etc.
• You want to have your data balanced, it’s bad
if one group is overrepresented or
underrepresented → learn to create a
representative sample, e.g. down sample data




2

, Correlation Coefficient
Pearson’s r measures the strength of a linear relationship (dependency)




Pearson’s correlation coefficient
• Numerator: covariance → to what extent do the features change together?
• Denominator: product of standard deviations → makes correlations independent of unit




Covariance and correlation
Covariance = indicates the relationship of two
variables whenever one variable changes. If an
increase in one variable results in an increase in
the other variable, both variables are said to have
a positive covariance

→ corresponds to the strength of the linear
relationship.



Magnitude (direction) of the covariance is not
easy to interpret


Correlation coefficient is normalized and
corresponds to strength of the linear relation

Divide variance by the product of the variable’s
standard deviations




3

Reviews from verified buyers

Showing all reviews
5 year ago

5.0

1 reviews

5
1
4
0
3
0
2
0
1
0
Trustworthy reviews on Stuvia

All reviews are made by real Stuvia users after verified purchases.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
ioumi Tilburg University
Follow You need to be logged in order to follow users or courses
Sold
11
Member since
12 year
Number of followers
9
Documents
1
Last sold
4 year ago
BSc Political Science: International Relations / MSc Data Science & Society

Hi there! I studied the BSc Political Science with a specialization in IR at the Vrije Universiteit Amsterdam. Currently I'm studying Data Science & Society at Tilburg University. Writing summaries has always be my way of learning: hopefully my documents will make the exam period easier for you!

5.0

1 reviews

5
1
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can immediately select a different document that better matches what you need.

Pay how you prefer, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card or EFT and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions