Data Mining
1
,Inhoudstafel:
0. General Introduction...............................................................................................................................................7
1. Introduction: Data-Analytic Thinking.....................................................................................................................14
1.1 The Ubiquity of Data Opportunities................................................................................................................14
1.2 Example: Hurricane Frances............................................................................................................................15
1.3 Example: Predicting Customer Churn..............................................................................................................15
1.4 Data Science, Engineering and Data-Driven Decision Making.........................................................................16
1.5 Data Processing and ‘Big Data’.......................................................................................................................17
1.6 From Big Data 1.0 to Big Data 2.0...................................................................................................................17
1.7 Data and Data Science Capability as a Strategic Asset....................................................................................18
1.8 Data-Analytic Thinking....................................................................................................................................19
1.9 This Book........................................................................................................................................................19
1.10 Data Mining and Data Science, Revisited (fundamental concepts)................................................................20
1.11 Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist.............................20
1.12 Summary........................................................................................................................................................20
2. Business Problems and Data Science Solutions.....................................................................................................21
2.1 From Business Problems to Data Mining Tasks................................................................................................21
2.2 Supervised Versus Unsupervised Methods......................................................................................................23
2.3 Data Mining and Its Results.............................................................................................................................24
2.4 The Data Mining Process.................................................................................................................................25
2.4.1 Business Understanding...............................................................................................................................25
2.4.2 Data Understanding......................................................................................................................................26
2.4.3 Data Preparation..........................................................................................................................................26
2.4.4 Modeling.......................................................................................................................................................26
2.4.5 Evaluation.....................................................................................................................................................27
2.4.6 Deployment..................................................................................................................................................27
2.5 Implications for Managing the Data Science Team..........................................................................................28
2.6 Other Analytics Techniques and Technologies................................................................................................28
2.6.1 Statistics........................................................................................................................................................28
2.6.2 Database Querying.......................................................................................................................................28
2.6.3 Data Warehousing........................................................................................................................................29
2.6.4 Regression Analysis.......................................................................................................................................29
2.6.5 Machine Learning and Data Mining..............................................................................................................29
2.6.6 Answering Business Questions with These Techniques................................................................................30
2.7 Summary..........................................................................................................................................................30
3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation...........................................31
3.1 Models, Induction, Deduction.........................................................................................................................31
2
, 3.2 Supervised Segmentation................................................................................................................................33
3.2.1 Selecting Informative Attributes...................................................................................................................34
3.2.2 Example: Attribute Selection with Information Gain (lezen)........................................................................37
3.2.3 Supervised Segmentation with Tree-Structured Models..............................................................................38
3.3 Visualizing Segmentations...............................................................................................................................39
3.4 Trees as Sets of Rules.......................................................................................................................................40
3.5 Probability Estimation......................................................................................................................................41
3.6 Example: Addressing the Churn Problem with Tree Induction (lezen).............................................................41
3.7 Summary..........................................................................................................................................................41
4. Fitting a Model to Data..........................................................................................................................................42
4.1 Classfication via Mathematical Functions........................................................................................................43
4.1.1 Linear Discriminant Functions.......................................................................................................................45
4.1.2 Optimizing an Objective Function.................................................................................................................47
4.1.3 An Example of Mining a Linear Discriminant from Data (lezen)...................................................................47
4.1.4 Linear Discriminant Functions for Scoring and Ranking Instances................................................................48
4.1.5 Support Vector Machines, Briefly.................................................................................................................48
4.2 Regression via Mathematical Functions..........................................................................................................49
4.3 Class Probability Estimation and Logistic “Regression”....................................................................................49
4.3.1 Logistic Regression: Some Technical Details (lezen).....................................................................................50
4.4 Example: Logistic Regression versus Tree Induction (lezen)............................................................................50
4.5 Nonlinear Functions, Support Vector Machines, and Neural Networks..........................................................51
4.6 Summary..........................................................................................................................................................52
5. Overfitting and Its Avoidance................................................................................................................................53
5.1 Generalization.................................................................................................................................................53
5.2 Overfitting........................................................................................................................................................53
5.3 Overfitting Examined.......................................................................................................................................54
5.3.1 Holdout Data and Fitting Graphs..................................................................................................................54
5.3.2 Overfitting in Tree Induction.........................................................................................................................56
5.3.3 Overfitting in Mathematical Functions.........................................................................................................57
5.4 Example: Overfitting Linear Functions (lezen).................................................................................................57
5.5 Example: Why Is Overfitting Bad? (lezen)........................................................................................................58
5.6 From Holdout Evaluation to Cross-Validation..................................................................................................59
5.7 Example: The Churn Dataset Revisited (lezen)................................................................................................60
5.8 Learning Curves...............................................................................................................................................61
5.9 Overfitting Avoidance and Complexity Control................................................................................................62
5.9.1 Avoiding Overfitting with Tree Induction......................................................................................................62
5.9.2 A General Method for Avoiding Overfitting..................................................................................................62
5.9.3 Avoiding Overfitting for Parameter Optimization (lezen).............................................................................63
3
, 5.10 Summary........................................................................................................................................................63
6. Similarity, Neighbors, and Clusters........................................................................................................................64
6.1 Similarity and Distance....................................................................................................................................64
6.2 Nearest-Neighbor Reasoning...........................................................................................................................65
6.2.1 Example: Whiskey Analytics (lezen)..............................................................................................................65
6.3 Nearest Neighbors for Predictive Modeling.....................................................................................................66
6.3.1 How Many Neighbors and How Much Influence?........................................................................................67
6.3.2 Geometric Interpretation, Overfitting, and Complexity Control...................................................................68
6.3.3 Issues with Nearest-Neighbor Methods.......................................................................................................69
6.4 Some Important Technical Details Relating to Similarities and Neighbors......................................................70
6.4.1 Heterogeneous Attributes............................................................................................................................70
6.4.2 Other Distance Functions (lezen)..................................................................................................................70
6.4.3 Combining Functions: Calculating Scores from Neighbors (lezen)................................................................70
6.5 Clustering.........................................................................................................................................................71
6.5.1 Example: Whiskey Analytics Revisited (lezen)..............................................................................................71
6.5.2 Hierarchical Clustering..................................................................................................................................71
6.5.3 Nearest Neighbors Revisited: Clustering Around Centroids.........................................................................73
6.5.4 Example: Clustering Business News Stories (lezen)......................................................................................75
6.5.5 Understanding the Results of Clustering......................................................................................................75
6.5.6 Using Supervised Learning to Generate Cluster Descriptions (lezen)...........................................................76
6.6 Stepping Back: Solving a Business Problem Versus Data Exploration..............................................................77
6.7 Summary..........................................................................................................................................................77
7. Decision Analytic Thinking I: What Is a Good Model?............................................................................................78
7.1 Evaluating Classifiers.......................................................................................................................................78
7.1.1 Plain Accuracy and Its Problems...................................................................................................................78
7.1.2 The Confusion Matrix...................................................................................................................................79
7.1.3 Problems with Unbalanced Classes..............................................................................................................80
7.1.4 Problems with Unequal Costs and Benefits..................................................................................................81
7.1.5 Generalizing Beyond Classification...............................................................................................................81
7.2 A Key Analytical Framework: Expected Value..................................................................................................81
7.2.1 Using Expected Value to Frame Classifier Use..............................................................................................82
7.2.2 Using Expected Value to Frame Classifier Evaluation...................................................................................82
7.3 Evaluation, Baseline Performance, and Implications for Investments in Data.................................................84
7.4 Summary..........................................................................................................................................................85
8. Visualizing Model Performance.............................................................................................................................86
8.1 Ranking Instead of Classifying..........................................................................................................................86
8.2 Profit Curves....................................................................................................................................................87
8.3 ROC Graphs and Curves...................................................................................................................................88
4
1
,Inhoudstafel:
0. General Introduction...............................................................................................................................................7
1. Introduction: Data-Analytic Thinking.....................................................................................................................14
1.1 The Ubiquity of Data Opportunities................................................................................................................14
1.2 Example: Hurricane Frances............................................................................................................................15
1.3 Example: Predicting Customer Churn..............................................................................................................15
1.4 Data Science, Engineering and Data-Driven Decision Making.........................................................................16
1.5 Data Processing and ‘Big Data’.......................................................................................................................17
1.6 From Big Data 1.0 to Big Data 2.0...................................................................................................................17
1.7 Data and Data Science Capability as a Strategic Asset....................................................................................18
1.8 Data-Analytic Thinking....................................................................................................................................19
1.9 This Book........................................................................................................................................................19
1.10 Data Mining and Data Science, Revisited (fundamental concepts)................................................................20
1.11 Chemistry Is Not About Test Tubes: Data Science Versus the Work of the Data Scientist.............................20
1.12 Summary........................................................................................................................................................20
2. Business Problems and Data Science Solutions.....................................................................................................21
2.1 From Business Problems to Data Mining Tasks................................................................................................21
2.2 Supervised Versus Unsupervised Methods......................................................................................................23
2.3 Data Mining and Its Results.............................................................................................................................24
2.4 The Data Mining Process.................................................................................................................................25
2.4.1 Business Understanding...............................................................................................................................25
2.4.2 Data Understanding......................................................................................................................................26
2.4.3 Data Preparation..........................................................................................................................................26
2.4.4 Modeling.......................................................................................................................................................26
2.4.5 Evaluation.....................................................................................................................................................27
2.4.6 Deployment..................................................................................................................................................27
2.5 Implications for Managing the Data Science Team..........................................................................................28
2.6 Other Analytics Techniques and Technologies................................................................................................28
2.6.1 Statistics........................................................................................................................................................28
2.6.2 Database Querying.......................................................................................................................................28
2.6.3 Data Warehousing........................................................................................................................................29
2.6.4 Regression Analysis.......................................................................................................................................29
2.6.5 Machine Learning and Data Mining..............................................................................................................29
2.6.6 Answering Business Questions with These Techniques................................................................................30
2.7 Summary..........................................................................................................................................................30
3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation...........................................31
3.1 Models, Induction, Deduction.........................................................................................................................31
2
, 3.2 Supervised Segmentation................................................................................................................................33
3.2.1 Selecting Informative Attributes...................................................................................................................34
3.2.2 Example: Attribute Selection with Information Gain (lezen)........................................................................37
3.2.3 Supervised Segmentation with Tree-Structured Models..............................................................................38
3.3 Visualizing Segmentations...............................................................................................................................39
3.4 Trees as Sets of Rules.......................................................................................................................................40
3.5 Probability Estimation......................................................................................................................................41
3.6 Example: Addressing the Churn Problem with Tree Induction (lezen).............................................................41
3.7 Summary..........................................................................................................................................................41
4. Fitting a Model to Data..........................................................................................................................................42
4.1 Classfication via Mathematical Functions........................................................................................................43
4.1.1 Linear Discriminant Functions.......................................................................................................................45
4.1.2 Optimizing an Objective Function.................................................................................................................47
4.1.3 An Example of Mining a Linear Discriminant from Data (lezen)...................................................................47
4.1.4 Linear Discriminant Functions for Scoring and Ranking Instances................................................................48
4.1.5 Support Vector Machines, Briefly.................................................................................................................48
4.2 Regression via Mathematical Functions..........................................................................................................49
4.3 Class Probability Estimation and Logistic “Regression”....................................................................................49
4.3.1 Logistic Regression: Some Technical Details (lezen).....................................................................................50
4.4 Example: Logistic Regression versus Tree Induction (lezen)............................................................................50
4.5 Nonlinear Functions, Support Vector Machines, and Neural Networks..........................................................51
4.6 Summary..........................................................................................................................................................52
5. Overfitting and Its Avoidance................................................................................................................................53
5.1 Generalization.................................................................................................................................................53
5.2 Overfitting........................................................................................................................................................53
5.3 Overfitting Examined.......................................................................................................................................54
5.3.1 Holdout Data and Fitting Graphs..................................................................................................................54
5.3.2 Overfitting in Tree Induction.........................................................................................................................56
5.3.3 Overfitting in Mathematical Functions.........................................................................................................57
5.4 Example: Overfitting Linear Functions (lezen).................................................................................................57
5.5 Example: Why Is Overfitting Bad? (lezen)........................................................................................................58
5.6 From Holdout Evaluation to Cross-Validation..................................................................................................59
5.7 Example: The Churn Dataset Revisited (lezen)................................................................................................60
5.8 Learning Curves...............................................................................................................................................61
5.9 Overfitting Avoidance and Complexity Control................................................................................................62
5.9.1 Avoiding Overfitting with Tree Induction......................................................................................................62
5.9.2 A General Method for Avoiding Overfitting..................................................................................................62
5.9.3 Avoiding Overfitting for Parameter Optimization (lezen).............................................................................63
3
, 5.10 Summary........................................................................................................................................................63
6. Similarity, Neighbors, and Clusters........................................................................................................................64
6.1 Similarity and Distance....................................................................................................................................64
6.2 Nearest-Neighbor Reasoning...........................................................................................................................65
6.2.1 Example: Whiskey Analytics (lezen)..............................................................................................................65
6.3 Nearest Neighbors for Predictive Modeling.....................................................................................................66
6.3.1 How Many Neighbors and How Much Influence?........................................................................................67
6.3.2 Geometric Interpretation, Overfitting, and Complexity Control...................................................................68
6.3.3 Issues with Nearest-Neighbor Methods.......................................................................................................69
6.4 Some Important Technical Details Relating to Similarities and Neighbors......................................................70
6.4.1 Heterogeneous Attributes............................................................................................................................70
6.4.2 Other Distance Functions (lezen)..................................................................................................................70
6.4.3 Combining Functions: Calculating Scores from Neighbors (lezen)................................................................70
6.5 Clustering.........................................................................................................................................................71
6.5.1 Example: Whiskey Analytics Revisited (lezen)..............................................................................................71
6.5.2 Hierarchical Clustering..................................................................................................................................71
6.5.3 Nearest Neighbors Revisited: Clustering Around Centroids.........................................................................73
6.5.4 Example: Clustering Business News Stories (lezen)......................................................................................75
6.5.5 Understanding the Results of Clustering......................................................................................................75
6.5.6 Using Supervised Learning to Generate Cluster Descriptions (lezen)...........................................................76
6.6 Stepping Back: Solving a Business Problem Versus Data Exploration..............................................................77
6.7 Summary..........................................................................................................................................................77
7. Decision Analytic Thinking I: What Is a Good Model?............................................................................................78
7.1 Evaluating Classifiers.......................................................................................................................................78
7.1.1 Plain Accuracy and Its Problems...................................................................................................................78
7.1.2 The Confusion Matrix...................................................................................................................................79
7.1.3 Problems with Unbalanced Classes..............................................................................................................80
7.1.4 Problems with Unequal Costs and Benefits..................................................................................................81
7.1.5 Generalizing Beyond Classification...............................................................................................................81
7.2 A Key Analytical Framework: Expected Value..................................................................................................81
7.2.1 Using Expected Value to Frame Classifier Use..............................................................................................82
7.2.2 Using Expected Value to Frame Classifier Evaluation...................................................................................82
7.3 Evaluation, Baseline Performance, and Implications for Investments in Data.................................................84
7.4 Summary..........................................................................................................................................................85
8. Visualizing Model Performance.............................................................................................................................86
8.1 Ranking Instead of Classifying..........................................................................................................................86
8.2 Profit Curves....................................................................................................................................................87
8.3 ROC Graphs and Curves...................................................................................................................................88
4