Data Mining
Organics Assignment
Table of contents
Contents
Summary................................................................................................................................................3
Business Problem...................................................................................................................................3
Data Mining Representation..................................................................................................................3
Methodology - data mining approach...................................................................................................3
,Data Exploration....................................................................................................................................5
Data partition creation of model sets....................................................................................................6
Data Modelling....................................................................................................................................13
Regression...........................................................................................................................................14
Neural Network...................................................................................................................................16
Development of models..................................................................................................................16
Model performance.........................................................................................................................18
Neural network architecture of best model....................................................................................18
Decision Tree.......................................................................................................................................19
Development of models..................................................................................................................19
Performance of models...................................................................................................................22
Overfitting and limitations...............................................................................................................23
Analysis of the best model...................................................................................................................24
Conclusion..........................................................................................................................................27
Recommendation................................................................................................................................27
References and Bibliography...............................................................................................................28
References...........................................................................................................................................28
Other Resources..................................................................................................................................28
Appendix..............................................................................................................................................29
My Reflections on the Patchwork assignment.....................................................................................30
2
IMAT Data Mining Organics
, Summary
This report presents the results of using the sas data mining framework, to build
models that will help the management of a supermarket concentrate their resources
on targeting customers that are most likely to purchase organic products. Three
binary classification models were generated using regression analysis, decision
trees and Neuro networks . Age and AFFL were identified as the most important
predictors. All three models were predictive and have a satisfactory level of
performance. The decision tree was chosen as the champion models based on
performance and also the techniques’ ability to provide an non-technical explanation
of the model with a lift value of 2.97 times greater than selecting customers at
random. Several recommendations are made for improving data quality for the next
cycle of data mining.
Business Problem
The business problem is to identify the customers that are most likely to purchase
organic products in the supermarket. A data models will be built data set
(organics.xls) collected during the supermarket incentive period. By identifying
customers who are likely to purchase organic products the company will be able to
target its marketing efforts more effectively which should result in more sales per
marketing advertising spend.
Data Mining Representation
The business problem, identification of customers who are likely to buy organic
products is a type of data mining representation known as a classification problem.
The most suitable target variable is ORGYN which is identified as a binary variable.
The remaining variables will be given roles INPUT and are assigned the default
measurement levels with the exception of AFFL which has been changed from
interval to ordinal.
Methodology - data mining approach
The process to be adopted is the first flow in the virtuous cycle of data mining which
has four distinctive steps:
Identify the business problem or opportunity.
Mining data to transform it into actionable information.
Acting on the information this is outside the remit of the brief (marketing
initiative driven by the model, eg targeted marketing offer)
Measuring the results of the marketing initiative this is outside the remit of the
brief (measure profitability of the pilot marketing study using the model).
3
IMAT Data Mining Organics
Organics Assignment
Table of contents
Contents
Summary................................................................................................................................................3
Business Problem...................................................................................................................................3
Data Mining Representation..................................................................................................................3
Methodology - data mining approach...................................................................................................3
,Data Exploration....................................................................................................................................5
Data partition creation of model sets....................................................................................................6
Data Modelling....................................................................................................................................13
Regression...........................................................................................................................................14
Neural Network...................................................................................................................................16
Development of models..................................................................................................................16
Model performance.........................................................................................................................18
Neural network architecture of best model....................................................................................18
Decision Tree.......................................................................................................................................19
Development of models..................................................................................................................19
Performance of models...................................................................................................................22
Overfitting and limitations...............................................................................................................23
Analysis of the best model...................................................................................................................24
Conclusion..........................................................................................................................................27
Recommendation................................................................................................................................27
References and Bibliography...............................................................................................................28
References...........................................................................................................................................28
Other Resources..................................................................................................................................28
Appendix..............................................................................................................................................29
My Reflections on the Patchwork assignment.....................................................................................30
2
IMAT Data Mining Organics
, Summary
This report presents the results of using the sas data mining framework, to build
models that will help the management of a supermarket concentrate their resources
on targeting customers that are most likely to purchase organic products. Three
binary classification models were generated using regression analysis, decision
trees and Neuro networks . Age and AFFL were identified as the most important
predictors. All three models were predictive and have a satisfactory level of
performance. The decision tree was chosen as the champion models based on
performance and also the techniques’ ability to provide an non-technical explanation
of the model with a lift value of 2.97 times greater than selecting customers at
random. Several recommendations are made for improving data quality for the next
cycle of data mining.
Business Problem
The business problem is to identify the customers that are most likely to purchase
organic products in the supermarket. A data models will be built data set
(organics.xls) collected during the supermarket incentive period. By identifying
customers who are likely to purchase organic products the company will be able to
target its marketing efforts more effectively which should result in more sales per
marketing advertising spend.
Data Mining Representation
The business problem, identification of customers who are likely to buy organic
products is a type of data mining representation known as a classification problem.
The most suitable target variable is ORGYN which is identified as a binary variable.
The remaining variables will be given roles INPUT and are assigned the default
measurement levels with the exception of AFFL which has been changed from
interval to ordinal.
Methodology - data mining approach
The process to be adopted is the first flow in the virtuous cycle of data mining which
has four distinctive steps:
Identify the business problem or opportunity.
Mining data to transform it into actionable information.
Acting on the information this is outside the remit of the brief (marketing
initiative driven by the model, eg targeted marketing offer)
Measuring the results of the marketing initiative this is outside the remit of the
brief (measure profitability of the pilot marketing study using the model).
3
IMAT Data Mining Organics