What is Data Mining?
“Data mining is the computational process of discovering patterns in large
data sets involving methods at the intersection of:
Statistics (branch of mathematics focused on data);
Machine Learning (branch of Computer Science studying learning from data);
Artificial Intelligence (interdisciplinary field aiming to develop intelligent machines);
Database systems.
Key aspects
Computation vs Large data sets (trade-off between processing time and memory)
Computation enables analysis of large data sets (computers as a tool and with growing data)
Data Mining often implies data discovery from databases (from unstructured data to
structured knowledge)
Text Mining (natural language processing): going from unstructured text to structured
knowledge
What is large amounts or big data?
Volume (too big: for manual analysis, to fit in RAM, to store on disk)
Variety (range of values: variance | Outliers, confounders and noise | Interactions, data is co-
dependent
Velocity (data changes quickly: require results before data changes | Streaming data, no
storage)
Application of data mining
Companies: Business Intelligence (Amazon, Booking, AH)
o Market analysis and management
Science: Knowledge Discovery (University, Laboratories)
o Scientific discovery in large data
What makes prediction possible?
Associations between features/target (Amazon)
Numerical: correlation coefficient
Categorical: mutual information Value of x1 contains information about value of x2
Fitting data is easy, but predictions are hard!
,Iris dataset
Pearson’s r (correlation coefficient)
Numerator: covariance (to what extent the features change together)
Denominator: product of standard deviations (makes correlations independent of units)
Pearson’s coefficient of Petal Length by Petal Width:
Caveats
Pearson’s r only measures linear dependency
Other types of dependency can also be used for
prediction!
Correlation does not imply causation, but it may still
enable prediction.
What is machine learning?
“A program is said to learn from experience (E) on task (T) and a performance (P) measure, if its
performance measured by P at tasks in T improves with E.”
,Supervised Learning
INPUT OUTPUT
Classification: output » class labels
Regression: output » continuous values
Classification | Regression
Supervised learning Workflow
1. Collect data (How do you select your sample? Reliability, privacy and other regulations.)
2. Label example (Annotation guidelines, measure inter-annotator agreement, crowdsourcing.)
3. Choose example representation
Features: attributes describing examples (
o Numerical
o Categorical
Possibly convert to feature vectors
o A vector is a fixed-size list of numbers
o Some learning algorithms require examples represented as vectors
4. Train model(s)
Keep some examples for final evaluation: test set
Use the rest for
o Learning: training set
o Tuning: validation set
5. Evaluate
Check performance of tuned model on test set
Goal: estimate how well your model will do in the real world
Keep evaluation realistic!
Parameter or model tuning
Learning algorithms typically have settings (aka hyperparameters)
For each value of hyperparameters:
o Apply algorithm to training set to learn
o Check performance on validation set
o Find/Choose best-performing setting
, Unsupervised learning
INPUT
Clustering: group similar objects
Dimensionality reduction: reduce random variables
Clustering | Dimensionality reduction
Clustering
Task of grouping a set of objects in such a way that objects in the same group (called a cluster) are
more similar (in some sense or another) to each other than to those in other groups (clusters).
Dimensionality reduction
Feature selection: reduce the large amount of data
o Reduce complexity and easier interpretation
o Reduce demand on resources (computation / memory)
o Reduce the ‘curse of dimensionality’
o Reduce chance of over-fitting
Feature extraction: often domain specific
o Image Processing: edge detection
o From pixels to reduced set of features
o Often part of pre-processing, but might contain the hard problems