Les 1
What is machine learning?
We want some machine which can learn (automatic solutions) by themselves.
Ml is branch of AI that enables computers to learn from data and improve their
performance on tasks over time without being explicitly programmed for each
task.
How can we automate problem solving?
f.e. flag spam in your inbox. Try and error. Our algorithm should learn about
certain words.
Rules
If (A or B or C) and not D, then SPAM. If w1x1 + w2x2 + w3x3 >= 0 then SPAM.
Learning from examples
Data collect examples of spam and non-spam emails
Learning algorithm learns patterns and rules from the data
Evaluation tests the models on new emails and compares results with
known labels
Deployment deploy the trained system for real-word use.
Learning algorithms
Tree-based: decision trees, random forests, gradient boosted trees
Linear classifiers and regressors: perceptron, linear and logistic regression
Neural networks: multi-layer perceptrons, deep learning
Typical ML applications
Recommender systems; based on earlier purchases or similar baskets to other
customers
Sales forecasts
Flas suspicious credit-card transactions
Guess persons age based on a sample of writing
Recognize handwritten numbers and letters
Determine whether a text expresses positive, negative or no opinion.
Recognize faces in photos.
Classify (medical) images.
Recommend books and movies to users based on their own and others'
purchase history.
Machine learning
Supervised learning
Learns from pre-labeled examples
,Unsupervised learning
No labels; Discover natural grouping and relationships.
Reinforcement learning
Agent will learn through some action to the environment. Trial and error.
Train models without having really explicit feedback. Agents will learn or adapt
some parameters.
Regression
Target is a real number. F.e. housing price prediction. Relation between size and
price, but not always. We can decide which model to use; linear, polynomial….
Binary classification
One of two options.
f.e. detect spam, predict sex (male/female)
Multiclass classification
One of a finite set of options
Multilabel classification
Multiple labels from a set of options
Ranking
An ordering of objects
Sequence labelling
A label for each element in input sequence. Sequence of input is same as
sequence of output.
Differences with sequence labelling
Input and output do not have the same sequence
Input and output can have different number of words.
Example is google translate, input is one language and output is another
language. “I need to work on this” is 6 words while translated to dutch “ik moet
hier aan werken” is 5 words.
,Model
Can be seen as mathematical representation of our learning algorithm. Variable x
input, feature, independent variables. Variable y output, target, dependent
variable. Model linear regression, logistic regression, perceptron, Decision
trees, Random forests, Gradient boosted trees, Artificial Neural Networks (ANN)
Data
Numbers
Text
Images
Any form of input that can be processed by algorithm
Data splitting
Need to split data. Make sure proportion is the same. Make sure the samples are
randomly split. When doing this we have to make sure we don’t have data
leakage. It means we don’t have a sample from the test data in the training data.
What we do with these sets; with training data we train the model. Based on the
output it will learn itself. After learning is finished we use test set and we fit it to
the trained model. Output are predictions. Evaluation is based on the ground
truth.
We need additional split called validation set and this is part of data we split
randomly for hyperparameter tuning. Used as a intermediate testing step, fit on
the trained model.
Traning, Tuning, and Testing: The
role of data splits
Splitting data for evaluation
, Training set: Learn, infer rules
Validation set: Monitor performance,
choose best learning options
Test set: Evaluate generalization to
real word setting. Not accessible in
advance
Alternatives
Cross validation
To make sure we train our model on al our samples and
not be biased, divide data into K folds. Keep one each
time as validation. This way we have a robust
evaluation of our model. We can test is on the test set.
Stratification
Use when imbalanced data set. Proportion of classes
the same in all data splits.
Time-series split
Need to make sure that this evaluation set that de data points are related to the
future.
Time- series split on cross validation
Gradually extend out training to see what is happening and then average the
evaluation on all of these divisions.
Evaluation metrics
What is important to consider?
Make sure that the metric you use for development set and
test set are the same.
Regression evaluation metrics
Mean absolute error
The average absolute difference between true value and
predicted value.
Mean squared error
The average square of the difference
between true value and predicted value
Coefficient of determination (R2)
The proportion of the variation in the dependent
variable that is predictable from the independent
variable(s)
Metrics for classification
Error rate
What is machine learning?
We want some machine which can learn (automatic solutions) by themselves.
Ml is branch of AI that enables computers to learn from data and improve their
performance on tasks over time without being explicitly programmed for each
task.
How can we automate problem solving?
f.e. flag spam in your inbox. Try and error. Our algorithm should learn about
certain words.
Rules
If (A or B or C) and not D, then SPAM. If w1x1 + w2x2 + w3x3 >= 0 then SPAM.
Learning from examples
Data collect examples of spam and non-spam emails
Learning algorithm learns patterns and rules from the data
Evaluation tests the models on new emails and compares results with
known labels
Deployment deploy the trained system for real-word use.
Learning algorithms
Tree-based: decision trees, random forests, gradient boosted trees
Linear classifiers and regressors: perceptron, linear and logistic regression
Neural networks: multi-layer perceptrons, deep learning
Typical ML applications
Recommender systems; based on earlier purchases or similar baskets to other
customers
Sales forecasts
Flas suspicious credit-card transactions
Guess persons age based on a sample of writing
Recognize handwritten numbers and letters
Determine whether a text expresses positive, negative or no opinion.
Recognize faces in photos.
Classify (medical) images.
Recommend books and movies to users based on their own and others'
purchase history.
Machine learning
Supervised learning
Learns from pre-labeled examples
,Unsupervised learning
No labels; Discover natural grouping and relationships.
Reinforcement learning
Agent will learn through some action to the environment. Trial and error.
Train models without having really explicit feedback. Agents will learn or adapt
some parameters.
Regression
Target is a real number. F.e. housing price prediction. Relation between size and
price, but not always. We can decide which model to use; linear, polynomial….
Binary classification
One of two options.
f.e. detect spam, predict sex (male/female)
Multiclass classification
One of a finite set of options
Multilabel classification
Multiple labels from a set of options
Ranking
An ordering of objects
Sequence labelling
A label for each element in input sequence. Sequence of input is same as
sequence of output.
Differences with sequence labelling
Input and output do not have the same sequence
Input and output can have different number of words.
Example is google translate, input is one language and output is another
language. “I need to work on this” is 6 words while translated to dutch “ik moet
hier aan werken” is 5 words.
,Model
Can be seen as mathematical representation of our learning algorithm. Variable x
input, feature, independent variables. Variable y output, target, dependent
variable. Model linear regression, logistic regression, perceptron, Decision
trees, Random forests, Gradient boosted trees, Artificial Neural Networks (ANN)
Data
Numbers
Text
Images
Any form of input that can be processed by algorithm
Data splitting
Need to split data. Make sure proportion is the same. Make sure the samples are
randomly split. When doing this we have to make sure we don’t have data
leakage. It means we don’t have a sample from the test data in the training data.
What we do with these sets; with training data we train the model. Based on the
output it will learn itself. After learning is finished we use test set and we fit it to
the trained model. Output are predictions. Evaluation is based on the ground
truth.
We need additional split called validation set and this is part of data we split
randomly for hyperparameter tuning. Used as a intermediate testing step, fit on
the trained model.
Traning, Tuning, and Testing: The
role of data splits
Splitting data for evaluation
, Training set: Learn, infer rules
Validation set: Monitor performance,
choose best learning options
Test set: Evaluate generalization to
real word setting. Not accessible in
advance
Alternatives
Cross validation
To make sure we train our model on al our samples and
not be biased, divide data into K folds. Keep one each
time as validation. This way we have a robust
evaluation of our model. We can test is on the test set.
Stratification
Use when imbalanced data set. Proportion of classes
the same in all data splits.
Time-series split
Need to make sure that this evaluation set that de data points are related to the
future.
Time- series split on cross validation
Gradually extend out training to see what is happening and then average the
evaluation on all of these divisions.
Evaluation metrics
What is important to consider?
Make sure that the metric you use for development set and
test set are the same.
Regression evaluation metrics
Mean absolute error
The average absolute difference between true value and
predicted value.
Mean squared error
The average square of the difference
between true value and predicted value
Coefficient of determination (R2)
The proportion of the variation in the dependent
variable that is predictable from the independent
variable(s)
Metrics for classification
Error rate