DATA MINING FOR BUSINESS AND GOVERNANCE
Chris Emmery, Çiçek Güven & Gonzalo Nápoles
TABLE OF CONTENTS
Introduction to Data Mining ........................................................................................................................... 5
1. What is Data Mining? ................................................................................................................................ 5
1.1. Key aspects: Computation & Large data sets .................................................................................... 5
1.2. Big Data ............................................................................................................................................. 6
1.3. Applications ....................................................................................................................................... 6
2. What makes prediction possible?............................................................................................................... 6
3. Data Mining as Applied Machine Learning ................................................................................................ 7
3.1. Supervised learning ........................................................................................................................... 7
3.2. Unsupervised Learning ...................................................................................................................... 8
Introduction to Data Science ......................................................................................................................... 10
1. What is data science?............................................................................................................................... 10
1.1. Example ........................................................................................................................................... 10
1.2. Terminology..................................................................................................................................... 10
1.3. The algorithm .................................................................................................................................. 12
1.4. Evaluation ........................................................................................................................................ 12
1.5. Computer hardware ........................................................................................................................ 13
2. Representing data .................................................................................................................................... 14
2.1. How do we get data? ....................................................................................................................... 14
2.2. File formats: raw-level representation of files ................................................................................ 15
2.3. Databases: storing the data a bit more cleverly .............................................................................. 16
2.4. Data science in practice: 80% vs. 20% ............................................................................................. 16
2.5. Representation of data .................................................................................................................... 16
Articles week 1 ............................................................................................................................................. 17
Prediction (SL): regression & classification .................................................................................................... 20
1. What makes prediction possible?............................................................................................................. 20
1.1. Correlation Coefficient: Pearson’s r................................................................................................. 20
2. Regression ................................................................................................................................................ 23
3. Classification ............................................................................................................................................ 24
3.1. Decision boundaries to label parts of a data as being a certain category ....................................... 26
3.2. ML algorithms for classification using decision boundaries ............................................................ 26
3.3. Multiclass classification (ó binary classification) ........................................................................... 35
4. Fitting and tuning ..................................................................................................................................... 36
4.1. Fitting............................................................................................................................................... 37
1
, 4.2. Tuning .............................................................................................................................................. 38
5. Evaluation ................................................................................................................................................ 43
5.1. Metrics for evaluating a Regression Task ........................................................................................ 43
5.2. Metrics for evaluating a Classification Task..................................................................................... 43
5.3. Schemes for applying metrics in model selection ........................................................................... 46
5.4. Best practices & common pitfalls .................................................................................................... 49
6. Models ...................................................................................................................................................... 55
6.1. Model selection ............................................................................................................................... 55
6.2. What is ‘learning’? ........................................................................................................................... 55
Working with Text data ................................................................................................................................ 56
1. Representing text as vectors .................................................................................................................... 56
1.1. Converting to numbers .................................................................................................................... 56
2. Binary vectors for Decision Tree classification (ID3) ................................................................................. 58
2.1. Inferring rules (decisions) by information gain: EX: Spam detection .............................................. 58
3. Using Vector Spaces and weightings ........................................................................................................ 62
3.1. Binary vs. Frequency........................................................................................................................ 62
3.2. Term frequencies............................................................................................................................. 62
3.3. (Inverse) document frequency ........................................................................................................ 64
3.4. Putting it together: tf * idf weighting............................................................................................... 64
3.5. Normalizing vector representations ................................................................................................ 65
4. Document classification using 𝑘-NN ........................................................................................................ 66
4.1. 𝓵𝟐 normalization ............................................................................................................................. 66
4.2. Cosine similarity .............................................................................................................................. 67
4.3. Using similarity in 𝒌-nn.................................................................................................................... 67
5. Practical examples.................................................................................................................................... 70
5.1. Naive text cleaning .......................................................................................................................... 70
6. Document classification ........................................................................................................................... 73
6.1. Sentiment analysis ........................................................................................................................... 73
6.2. Build a model ................................................................................................................................... 75
6.3. Test our model ................................................................................................................................ 82
Dimensionality reduction .............................................................................................................................. 83
1. The importance of dimensions ................................................................................................................. 83
2. Visualization ............................................................................................................................................. 85
2.1. Box plots .......................................................................................................................................... 85
2.2. Histogram ........................................................................................................................................ 85
2.3. Scatter plots..................................................................................................................................... 85
3. Dimensionality reduction ......................................................................................................................... 86
3.1. Feature selection ............................................................................................................................. 86
3.2. Feature extraction ........................................................................................................................... 88
4. Deep neural networks .............................................................................................................................. 90
Unsupervised learning .................................................................................................................................. 91
2
, 1. Techniques................................................................................................................................................ 92
1.1. CRISP trough k-means algorithm (most important method) ........................................................... 92
1.2. Fuzzy trough Fuzzy c-means algorithm............................................................................................ 93
1.3. Hierarchical clustering ..................................................................................................................... 95
2. Distance function...................................................................................................................................... 96
3. Evaluation method ................................................................................................................................... 97
3.1. The Silhouette coefficient/score ..................................................................................................... 97
3.2. Dunn index ...................................................................................................................................... 97
Association mining........................................................................................................................................ 98
1. Measures: support & confidence .............................................................................................................. 99
1.1. Support ............................................................................................................................................ 99
1.2. Confidence....................................................................................................................................... 99
2. Mining association rules......................................................................................................................... 100
3. A priori algorithm ................................................................................................................................... 101
3.1. The algorithm ................................................................................................................................ 101
3.2. Considerations ............................................................................................................................... 102
3.3. Setting the support parameter (minsup)....................................................................................... 102
3.4. Pattern evaluation ......................................................................................................................... 103
4. Itemset taxonomy .................................................................................................................................. 104
4.1. Maximal frequent itemset ............................................................................................................. 104
4.2. Closed itemset ............................................................................................................................... 104
4.3. Maximal vs. closed......................................................................................................................... 105
5. Quantitative association rules ................................................................................................................ 105
Mining massive data ................................................................................................................................... 107
1. Parallelization......................................................................................................................................... 107
1.1. Requirements ................................................................................................................................ 108
1.2. How does parallelization work? .................................................................................................... 109
2. Bagging, Boosting, and Batching ........................................................................................................... 111
2.1. Boosting (ex. AdaBoost) ................................................................................................................ 111
2.2. Averaging (ex. Bagging, Random Forests) ..................................................................................... 113
2.3. Batching (online learning) ............................................................................................................. 115
2.4. Drawbacks of ensemble methods ................................................................................................. 116
3. Distributed Computing ........................................................................................................................... 117
3.1. Distributing Machine Learning models .......................................................................................... 117
3.2. Distributed file storage .................................................................................................................. 118
3.3. Map reduce ................................................................................................................................... 119
Deep learning ............................................................................................................................................. 121
1. A brief history of AI ................................................................................................................................. 121
1.1. Alan Turing .................................................................................................................................... 121
1.2. Sci-project (1974) .......................................................................................................................... 122
1.3. The Sojourner Rover (1997) .......................................................................................................... 123
1.4. “Sub-symbolic” AI (1988-2016) ..................................................................................................... 123
3
, 2. Recognizing patterns .............................................................................................................................. 123
2.1. Neural networks ............................................................................................................................ 123
2.2. McCulloch-Pitts Neurons (1947).................................................................................................... 125
2.3. Deep Learning (2015) .................................................................................................................... 126
3. Many successes of DL ............................................................................................................................. 131
4. Conclusion .............................................................................................................................................. 133
4
Chris Emmery, Çiçek Güven & Gonzalo Nápoles
TABLE OF CONTENTS
Introduction to Data Mining ........................................................................................................................... 5
1. What is Data Mining? ................................................................................................................................ 5
1.1. Key aspects: Computation & Large data sets .................................................................................... 5
1.2. Big Data ............................................................................................................................................. 6
1.3. Applications ....................................................................................................................................... 6
2. What makes prediction possible?............................................................................................................... 6
3. Data Mining as Applied Machine Learning ................................................................................................ 7
3.1. Supervised learning ........................................................................................................................... 7
3.2. Unsupervised Learning ...................................................................................................................... 8
Introduction to Data Science ......................................................................................................................... 10
1. What is data science?............................................................................................................................... 10
1.1. Example ........................................................................................................................................... 10
1.2. Terminology..................................................................................................................................... 10
1.3. The algorithm .................................................................................................................................. 12
1.4. Evaluation ........................................................................................................................................ 12
1.5. Computer hardware ........................................................................................................................ 13
2. Representing data .................................................................................................................................... 14
2.1. How do we get data? ....................................................................................................................... 14
2.2. File formats: raw-level representation of files ................................................................................ 15
2.3. Databases: storing the data a bit more cleverly .............................................................................. 16
2.4. Data science in practice: 80% vs. 20% ............................................................................................. 16
2.5. Representation of data .................................................................................................................... 16
Articles week 1 ............................................................................................................................................. 17
Prediction (SL): regression & classification .................................................................................................... 20
1. What makes prediction possible?............................................................................................................. 20
1.1. Correlation Coefficient: Pearson’s r................................................................................................. 20
2. Regression ................................................................................................................................................ 23
3. Classification ............................................................................................................................................ 24
3.1. Decision boundaries to label parts of a data as being a certain category ....................................... 26
3.2. ML algorithms for classification using decision boundaries ............................................................ 26
3.3. Multiclass classification (ó binary classification) ........................................................................... 35
4. Fitting and tuning ..................................................................................................................................... 36
4.1. Fitting............................................................................................................................................... 37
1
, 4.2. Tuning .............................................................................................................................................. 38
5. Evaluation ................................................................................................................................................ 43
5.1. Metrics for evaluating a Regression Task ........................................................................................ 43
5.2. Metrics for evaluating a Classification Task..................................................................................... 43
5.3. Schemes for applying metrics in model selection ........................................................................... 46
5.4. Best practices & common pitfalls .................................................................................................... 49
6. Models ...................................................................................................................................................... 55
6.1. Model selection ............................................................................................................................... 55
6.2. What is ‘learning’? ........................................................................................................................... 55
Working with Text data ................................................................................................................................ 56
1. Representing text as vectors .................................................................................................................... 56
1.1. Converting to numbers .................................................................................................................... 56
2. Binary vectors for Decision Tree classification (ID3) ................................................................................. 58
2.1. Inferring rules (decisions) by information gain: EX: Spam detection .............................................. 58
3. Using Vector Spaces and weightings ........................................................................................................ 62
3.1. Binary vs. Frequency........................................................................................................................ 62
3.2. Term frequencies............................................................................................................................. 62
3.3. (Inverse) document frequency ........................................................................................................ 64
3.4. Putting it together: tf * idf weighting............................................................................................... 64
3.5. Normalizing vector representations ................................................................................................ 65
4. Document classification using 𝑘-NN ........................................................................................................ 66
4.1. 𝓵𝟐 normalization ............................................................................................................................. 66
4.2. Cosine similarity .............................................................................................................................. 67
4.3. Using similarity in 𝒌-nn.................................................................................................................... 67
5. Practical examples.................................................................................................................................... 70
5.1. Naive text cleaning .......................................................................................................................... 70
6. Document classification ........................................................................................................................... 73
6.1. Sentiment analysis ........................................................................................................................... 73
6.2. Build a model ................................................................................................................................... 75
6.3. Test our model ................................................................................................................................ 82
Dimensionality reduction .............................................................................................................................. 83
1. The importance of dimensions ................................................................................................................. 83
2. Visualization ............................................................................................................................................. 85
2.1. Box plots .......................................................................................................................................... 85
2.2. Histogram ........................................................................................................................................ 85
2.3. Scatter plots..................................................................................................................................... 85
3. Dimensionality reduction ......................................................................................................................... 86
3.1. Feature selection ............................................................................................................................. 86
3.2. Feature extraction ........................................................................................................................... 88
4. Deep neural networks .............................................................................................................................. 90
Unsupervised learning .................................................................................................................................. 91
2
, 1. Techniques................................................................................................................................................ 92
1.1. CRISP trough k-means algorithm (most important method) ........................................................... 92
1.2. Fuzzy trough Fuzzy c-means algorithm............................................................................................ 93
1.3. Hierarchical clustering ..................................................................................................................... 95
2. Distance function...................................................................................................................................... 96
3. Evaluation method ................................................................................................................................... 97
3.1. The Silhouette coefficient/score ..................................................................................................... 97
3.2. Dunn index ...................................................................................................................................... 97
Association mining........................................................................................................................................ 98
1. Measures: support & confidence .............................................................................................................. 99
1.1. Support ............................................................................................................................................ 99
1.2. Confidence....................................................................................................................................... 99
2. Mining association rules......................................................................................................................... 100
3. A priori algorithm ................................................................................................................................... 101
3.1. The algorithm ................................................................................................................................ 101
3.2. Considerations ............................................................................................................................... 102
3.3. Setting the support parameter (minsup)....................................................................................... 102
3.4. Pattern evaluation ......................................................................................................................... 103
4. Itemset taxonomy .................................................................................................................................. 104
4.1. Maximal frequent itemset ............................................................................................................. 104
4.2. Closed itemset ............................................................................................................................... 104
4.3. Maximal vs. closed......................................................................................................................... 105
5. Quantitative association rules ................................................................................................................ 105
Mining massive data ................................................................................................................................... 107
1. Parallelization......................................................................................................................................... 107
1.1. Requirements ................................................................................................................................ 108
1.2. How does parallelization work? .................................................................................................... 109
2. Bagging, Boosting, and Batching ........................................................................................................... 111
2.1. Boosting (ex. AdaBoost) ................................................................................................................ 111
2.2. Averaging (ex. Bagging, Random Forests) ..................................................................................... 113
2.3. Batching (online learning) ............................................................................................................. 115
2.4. Drawbacks of ensemble methods ................................................................................................. 116
3. Distributed Computing ........................................................................................................................... 117
3.1. Distributing Machine Learning models .......................................................................................... 117
3.2. Distributed file storage .................................................................................................................. 118
3.3. Map reduce ................................................................................................................................... 119
Deep learning ............................................................................................................................................. 121
1. A brief history of AI ................................................................................................................................. 121
1.1. Alan Turing .................................................................................................................................... 121
1.2. Sci-project (1974) .......................................................................................................................... 122
1.3. The Sojourner Rover (1997) .......................................................................................................... 123
1.4. “Sub-symbolic” AI (1988-2016) ..................................................................................................... 123
3
, 2. Recognizing patterns .............................................................................................................................. 123
2.1. Neural networks ............................................................................................................................ 123
2.2. McCulloch-Pitts Neurons (1947).................................................................................................... 125
2.3. Deep Learning (2015) .................................................................................................................... 126
3. Many successes of DL ............................................................................................................................. 131
4. Conclusion .............................................................................................................................................. 133
4