Fundamental of Data Science
Introduction
Fundamental concepts of data science
AI models can be useful in various situations
° complex pattern recognition
° can make very detailed videos AI can help with decision making
° goof at coding BUT need to understand how it works before you rely on it
° use of data to predict the future -> always need to stay critical
°… -> the use of filters that we as human cannot see can have an
impact on the output of the model
Terminology
• Classic business intelligence
= you know what you are looking for
>> querying
you write a question and get an answer
>> OLAP (Online Analytical Processing)
pre computes interesting dimensions of big raw data bases
• Data science
= extracts knowledge from data
• Data mining
= activity of extracting data / focus on finding hidden patterns or relationships
• Big data
= data which is so large that a standard program cannot handle it because of a lack in
storage or too much variety in data types
• Artificial Intelligence (AI) ! has a very vague definition !
= group of techniques that machines use to achieve intelligent behaviours
• Machine learning
= Subset of AI that improve with data
• Deep learning
= subset of machine learning which uses neural network technology
-> useful for very complex computation
Applications
- demand forecasting systems used by Colruyt
- getting accurate risk assessment ex. used by banks by giving out a loan
-…
BUT privacy may be a concern
-> the more data is put into the model, the more accurate it is BUT the less privacy you have
1
,? What is data ?
the basis of all systems is DATA
Everything that can be stored in bits and bites
-> can be structured ex. table
-> can be unstructured ex. text
Data is a valuable asset since it can help with better decision making
! important to know !
data itself is not valuable BUT the real value lies in extracting meaningful information from it
? What is a model ?
model = abstract representation of (a part of) reality
BUT the usefulness of a model depends on what you want to communicate
-> need to choose well which parts of reality need to be captured in the model
example
linear regression is a model
-> is all about finding a function for given parameters
BUT we will have an error
= explains how well you model fits reality
(difference between model and reality)
Types of machine learning
There are 3 types of machine learning
! on exam you need to be able to select the right model type for a certain case !
• SUPERVISED LEARNING
= main application of machine learning
° there is a target variable ‘y’
° you have historical values ‘y’
° if y discrete = classification: binary or categorical (nominal/ordinal)
° if y continuous = regression
° often used for a prediction (estimation of an unknown value)
• UNSUPERVISED LEARNING
° there is no target variable ‘y’
° don’t have historical values ‘y’
° multiple ways to extract information from data
- anomaly detection: how different is observation x in comparison to the others
-> looking for outliers
-> can be useful to detect fraudulent actions
- clustering: making groups of similar observations
- generative models: have existing set of inputs and want to generate more realistic
(special case) input (= same distribution)
2
, • REINFORCEMENT LEARNING
= has no real dataset, learn through interaction with environment
! almost never used in business context !
reasons
° the added value of it in business is small in comparison to the others
° no model but an agent (learns from trial and error)
° trial and error not suitable in a real business world
° works only if you have a very good simulator (often not the case)
exercise
detecting fraudulent transactions
-> classification if huge number of historical transactions (supervised)
OR
-> unsupervised if no historical transactions
predict future income
-> supervised because you have a target = income
-> regression because ‘y’ is a continuous variable
! cannot be unsupervised because you need historical data !
detection degree of burn
-> supervised learning (multi class classification) if historical data
OR
-> unsupervised based on clusters
Composing new music based on existing music
-> generative because need to learn the distribution of existing music to make a new one
Supervised vs. unsupervised learning
° if you have historical data of high quality and clear labels
THEN supervised is better
° if proxy’s need to be made and you have only raw data (without labels)
THEN unsupervised is better
CRISP-DM
= Cross Industry Standard Process for Data Mining
-> structures the process starting with a problem to end with a solution
is an iterative process since it always need to be adapted to a changing environment
1. What is the actual business problem and which data needed
2. Need to understand gathered data
3. Preparing data for algorithm
! depending on the algorithm, different data types are needed !
4. Creating the model
5. Evaluation how trustful model is
6. Deployment
3
, Exercise – churn predictions (= customers leaving the company)
Attracting new customers is more costly than keeping existing ones
SO want data that predicts churn in order to keep those customers
could therefore use historical data ex. why did a customer left
-> sometimes there is a gap between the data you would like to gather and the data you have
= supervised classification algorithm (churn/ no churn)
BUT therefore we need a label
-> need to make a choice at what moment someone is seen as ‘churn’
! the chosen data and the chosen label already has an impact on the solution !
! interesting !
the churn prediction model is the most developed model
-> almost all big companies are using it because of the huge value
4
Introduction
Fundamental concepts of data science
AI models can be useful in various situations
° complex pattern recognition
° can make very detailed videos AI can help with decision making
° goof at coding BUT need to understand how it works before you rely on it
° use of data to predict the future -> always need to stay critical
°… -> the use of filters that we as human cannot see can have an
impact on the output of the model
Terminology
• Classic business intelligence
= you know what you are looking for
>> querying
you write a question and get an answer
>> OLAP (Online Analytical Processing)
pre computes interesting dimensions of big raw data bases
• Data science
= extracts knowledge from data
• Data mining
= activity of extracting data / focus on finding hidden patterns or relationships
• Big data
= data which is so large that a standard program cannot handle it because of a lack in
storage or too much variety in data types
• Artificial Intelligence (AI) ! has a very vague definition !
= group of techniques that machines use to achieve intelligent behaviours
• Machine learning
= Subset of AI that improve with data
• Deep learning
= subset of machine learning which uses neural network technology
-> useful for very complex computation
Applications
- demand forecasting systems used by Colruyt
- getting accurate risk assessment ex. used by banks by giving out a loan
-…
BUT privacy may be a concern
-> the more data is put into the model, the more accurate it is BUT the less privacy you have
1
,? What is data ?
the basis of all systems is DATA
Everything that can be stored in bits and bites
-> can be structured ex. table
-> can be unstructured ex. text
Data is a valuable asset since it can help with better decision making
! important to know !
data itself is not valuable BUT the real value lies in extracting meaningful information from it
? What is a model ?
model = abstract representation of (a part of) reality
BUT the usefulness of a model depends on what you want to communicate
-> need to choose well which parts of reality need to be captured in the model
example
linear regression is a model
-> is all about finding a function for given parameters
BUT we will have an error
= explains how well you model fits reality
(difference between model and reality)
Types of machine learning
There are 3 types of machine learning
! on exam you need to be able to select the right model type for a certain case !
• SUPERVISED LEARNING
= main application of machine learning
° there is a target variable ‘y’
° you have historical values ‘y’
° if y discrete = classification: binary or categorical (nominal/ordinal)
° if y continuous = regression
° often used for a prediction (estimation of an unknown value)
• UNSUPERVISED LEARNING
° there is no target variable ‘y’
° don’t have historical values ‘y’
° multiple ways to extract information from data
- anomaly detection: how different is observation x in comparison to the others
-> looking for outliers
-> can be useful to detect fraudulent actions
- clustering: making groups of similar observations
- generative models: have existing set of inputs and want to generate more realistic
(special case) input (= same distribution)
2
, • REINFORCEMENT LEARNING
= has no real dataset, learn through interaction with environment
! almost never used in business context !
reasons
° the added value of it in business is small in comparison to the others
° no model but an agent (learns from trial and error)
° trial and error not suitable in a real business world
° works only if you have a very good simulator (often not the case)
exercise
detecting fraudulent transactions
-> classification if huge number of historical transactions (supervised)
OR
-> unsupervised if no historical transactions
predict future income
-> supervised because you have a target = income
-> regression because ‘y’ is a continuous variable
! cannot be unsupervised because you need historical data !
detection degree of burn
-> supervised learning (multi class classification) if historical data
OR
-> unsupervised based on clusters
Composing new music based on existing music
-> generative because need to learn the distribution of existing music to make a new one
Supervised vs. unsupervised learning
° if you have historical data of high quality and clear labels
THEN supervised is better
° if proxy’s need to be made and you have only raw data (without labels)
THEN unsupervised is better
CRISP-DM
= Cross Industry Standard Process for Data Mining
-> structures the process starting with a problem to end with a solution
is an iterative process since it always need to be adapted to a changing environment
1. What is the actual business problem and which data needed
2. Need to understand gathered data
3. Preparing data for algorithm
! depending on the algorithm, different data types are needed !
4. Creating the model
5. Evaluation how trustful model is
6. Deployment
3
, Exercise – churn predictions (= customers leaving the company)
Attracting new customers is more costly than keeping existing ones
SO want data that predicts churn in order to keep those customers
could therefore use historical data ex. why did a customer left
-> sometimes there is a gap between the data you would like to gather and the data you have
= supervised classification algorithm (churn/ no churn)
BUT therefore we need a label
-> need to make a choice at what moment someone is seen as ‘churn’
! the chosen data and the chosen label already has an impact on the solution !
! interesting !
the churn prediction model is the most developed model
-> almost all big companies are using it because of the huge value
4