Samenvatting Data Science
1. Chapter 1
1.1 introduction
Nowadays, AI is used more and more.
Before: Innovations were spaced out → society had time to become aware and
integrate them before the next innovation arrived.
Now: Innovation cycles are accelerating -> Awareness and implementation
happen almost simultaneously, leaving little time for adaptation
Implications:
Society is part of a live experiment: uncertain future developments &
limited understanding of the impact on society.
Societal bounding of the experiment: Legal and ethical frameworks lag
behind technology
Example:
Economic standpoint: An AI model is cheaper than a doctor ->
fire doctors
Rational standpoint: AI is not the best for: very rare cases,
patient acceptance, …
Use AI to support doctors, but we still need doctors for
patient acceptability, etc..
Don’t take this data as a given; be critical of how the
research was done. (Maybe part of the data could have
been used to train or benchmark AI systems.)
On some points, AI isn’t on the human level yet. It is vital to understand how
these systems operate. Ex: picture of a panda overlayed with a noise (very low
amount), and the model gave the wrong answer.
1
,Terminology:
Classic business intelligence: You know what you are looking for (predetermined)
-> no modelling or pattern finding.
Querying
o You know exactly what you are looking for
o SQL
o SELECT * FROM CUSTOMERS WHERE AGE > 45
OLAP: Online Analytical Processing
o GUI to query large data collections in real-time
o Pre-programmed dimensions of analysis
Data science: A set of fundamental principles that guide the extraction of
knowledge from data
Data mining: The extraction of knowledge from data, via technologies that
incorporate these principles.
Big Data: Data that is so large that traditional data storage and processing
systems are unable to deal with it (Velocity, volume, variety)
Data mining : You don’t know what you’re looking for/ want to find new
intricate patterns in the (big) data.
Concerns:
• Modern ML techniques are very good at learning complex patterns in data
to solve certain types of predefined tasks
• Data science harnesses these techniques to solve commercial and
business issues to create value
2
,1.2 Data
Data is the basis. It is a “raw stream of facts” and can be structured ( ex: Excel
table) or unstructured (ex: text). It is information/ knowledge.
It can lead to better decision-making through data science.
Data is a valuable asset because of the potential to make better decisions
based on the data itself.
Important technology: machine learning -> Learning a model from data
What is a model?
Model = An (abstract) representation of (a part) of reality.
In ML: a model is learned/trained by a Machine Learning Algorithm,
based on data.
For different parameters, we end up with different models
The estimation of an unknown value = prediction
Best linear fit = the linear model that fits the training data best
Model learning:
Finding a mapping
Goal: learn a mapping (based on data).
This mapping is determined by the values of the model parameters.
o Here: Parameters determine the specific function within the chosen
model class
Extra: The chosen model class ( ex: linear vs neural network) defines the form of
the mapping; the parameter values specify which exact mapping we get.
Approximating complex functions with neural networks
Neural networks extend this idea: they model more complex, nonlinear
functions.
Compared to simple models (linear regression) , neural networks involve
many more parameters, but the underlying principle is the same:
o Choose a model class (which defines the functional form) , then find
the optimal parameters that give the best fit to the data.
3
, General procedure
1. Start with data.
2. Choose a model class (defines the functional form).
3. Learn a mapping by adjusting the parameters.
4. Optimize the parameters to achieve the best fit.
o This process of parameters optimization = machine learning.
In ML we learn functions based on data, and then we embed these in decision-
making ( sometimes it is supported by multiple distributions/functions -> so also
by different models)
See business example slides!
1.3 types of machine learning
There are 3 types of machine learning: supervised, unsupervised, and
reinforcement learning.
Supervised learning: learning a mapping x -> y or f(x) = y
y is the outcome/target/label
Dependent on the type of y (target variable) we have:
o Classification -> if target variable is discrete/ categorical
o Regression -> if the target variable is continuous (ex: linear
regression!)
Prediction: estimation of an unknown value (doesn't have to be in the
future, ex: show a picture -> predict if it is a cat or dog)
o You can only do a prediction after you learned the mapping
Extra info classification:
Binary categorical target variable
o Binary classification
o Binary outcome ( only 2 outcomes -> E.g.: fraud or no fraud )
Categorical target variable: multiclass classification ( more then 2
outcomes)
o Ordinal: E.g.: predicting credit scores -> natural ordering between
variables ( E.g.: low, medium, high,..)
4
1. Chapter 1
1.1 introduction
Nowadays, AI is used more and more.
Before: Innovations were spaced out → society had time to become aware and
integrate them before the next innovation arrived.
Now: Innovation cycles are accelerating -> Awareness and implementation
happen almost simultaneously, leaving little time for adaptation
Implications:
Society is part of a live experiment: uncertain future developments &
limited understanding of the impact on society.
Societal bounding of the experiment: Legal and ethical frameworks lag
behind technology
Example:
Economic standpoint: An AI model is cheaper than a doctor ->
fire doctors
Rational standpoint: AI is not the best for: very rare cases,
patient acceptance, …
Use AI to support doctors, but we still need doctors for
patient acceptability, etc..
Don’t take this data as a given; be critical of how the
research was done. (Maybe part of the data could have
been used to train or benchmark AI systems.)
On some points, AI isn’t on the human level yet. It is vital to understand how
these systems operate. Ex: picture of a panda overlayed with a noise (very low
amount), and the model gave the wrong answer.
1
,Terminology:
Classic business intelligence: You know what you are looking for (predetermined)
-> no modelling or pattern finding.
Querying
o You know exactly what you are looking for
o SQL
o SELECT * FROM CUSTOMERS WHERE AGE > 45
OLAP: Online Analytical Processing
o GUI to query large data collections in real-time
o Pre-programmed dimensions of analysis
Data science: A set of fundamental principles that guide the extraction of
knowledge from data
Data mining: The extraction of knowledge from data, via technologies that
incorporate these principles.
Big Data: Data that is so large that traditional data storage and processing
systems are unable to deal with it (Velocity, volume, variety)
Data mining : You don’t know what you’re looking for/ want to find new
intricate patterns in the (big) data.
Concerns:
• Modern ML techniques are very good at learning complex patterns in data
to solve certain types of predefined tasks
• Data science harnesses these techniques to solve commercial and
business issues to create value
2
,1.2 Data
Data is the basis. It is a “raw stream of facts” and can be structured ( ex: Excel
table) or unstructured (ex: text). It is information/ knowledge.
It can lead to better decision-making through data science.
Data is a valuable asset because of the potential to make better decisions
based on the data itself.
Important technology: machine learning -> Learning a model from data
What is a model?
Model = An (abstract) representation of (a part) of reality.
In ML: a model is learned/trained by a Machine Learning Algorithm,
based on data.
For different parameters, we end up with different models
The estimation of an unknown value = prediction
Best linear fit = the linear model that fits the training data best
Model learning:
Finding a mapping
Goal: learn a mapping (based on data).
This mapping is determined by the values of the model parameters.
o Here: Parameters determine the specific function within the chosen
model class
Extra: The chosen model class ( ex: linear vs neural network) defines the form of
the mapping; the parameter values specify which exact mapping we get.
Approximating complex functions with neural networks
Neural networks extend this idea: they model more complex, nonlinear
functions.
Compared to simple models (linear regression) , neural networks involve
many more parameters, but the underlying principle is the same:
o Choose a model class (which defines the functional form) , then find
the optimal parameters that give the best fit to the data.
3
, General procedure
1. Start with data.
2. Choose a model class (defines the functional form).
3. Learn a mapping by adjusting the parameters.
4. Optimize the parameters to achieve the best fit.
o This process of parameters optimization = machine learning.
In ML we learn functions based on data, and then we embed these in decision-
making ( sometimes it is supported by multiple distributions/functions -> so also
by different models)
See business example slides!
1.3 types of machine learning
There are 3 types of machine learning: supervised, unsupervised, and
reinforcement learning.
Supervised learning: learning a mapping x -> y or f(x) = y
y is the outcome/target/label
Dependent on the type of y (target variable) we have:
o Classification -> if target variable is discrete/ categorical
o Regression -> if the target variable is continuous (ex: linear
regression!)
Prediction: estimation of an unknown value (doesn't have to be in the
future, ex: show a picture -> predict if it is a cat or dog)
o You can only do a prediction after you learned the mapping
Extra info classification:
Binary categorical target variable
o Binary classification
o Binary outcome ( only 2 outcomes -> E.g.: fraud or no fraud )
Categorical target variable: multiclass classification ( more then 2
outcomes)
o Ordinal: E.g.: predicting credit scores -> natural ordering between
variables ( E.g.: low, medium, high,..)
4