0 Introduction.......................................................................................................................... 4
0.1 Belang data voor bedrijven ......................................................................................................... 4
0.2 Belang data voor studenten ........................................................................................................ 5
1 Data-analytical thinking ........................................................................................................ 5
1.1 Waarom? ..................................................................................................................................... 5
1.2 Voorbeelden ................................................................................................................................ 6
1.3 Data analytisch denken ............................................................................................................... 6
1.4 Data Mining & data science......................................................................................................... 6
2 Business problems and data science solutions ....................................................................... 7
2.1 Data mining tasks ........................................................................................................................ 7
2.2 Supervised vs unsupervised methods ......................................................................................... 8
2.3 Data mining ................................................................................................................................. 9
2.4 Implicaties voor het managen van het Data Science team ....................................................... 12
2.5 Andere analyse technieken en technologieën .......................................................................... 12
3 Introduction to predictive modelling ................................................................................... 14
3.1 Terminologie.............................................................................................................................. 14
3.2 Supervised segmentation .......................................................................................................... 15
3.3 Selecting informative attributes ................................................................................................ 16
3.4 Supervised segmentation with Tree-Structured models .......................................................... 19
3.5 Visualizing segmentations ......................................................................................................... 22
3.6 Probability estimation ............................................................................................................... 23
3.7 Samenvatting............................................................................................................................. 24
4 Fitting a model to data ........................................................................................................ 25
4.1 Linear discriminant functions .................................................................................................... 25
4.2 Optimizing an objective function .............................................................................................. 26
4.3 Support vector machines .......................................................................................................... 27
4.4 Regression via mathematical functions..................................................................................... 28
4.5 Classification: Scoring and ranking ............................................................................................ 29
4.6 Class probability estimation & Logistic regression .................................................................... 29
4.7 Logistic regression vs tree induction ......................................................................................... 32
4.8 Wat als de data niet-lineair is? .................................................................................................. 33
5 Overfitting and its avoidance............................................................................................... 34
5.1 Overfitting & Generalisatie ....................................................................................................... 34
5.2 Overfitting herkennen ............................................................................................................... 34
, 2
5.3 Waarom is overfitting slecht ..................................................................................................... 38
5.4 Voorkomen van overfitting ....................................................................................................... 39
6 Similarity, Neighbors & Clusters .......................................................................................... 43
6.1 Similarity & distance .................................................................................................................. 43
6.2 Nearest neighbors ..................................................................................................................... 45
6.3 Geometrische interpretatie, overfitting en complexity control................................................ 47
6.4 Problemen met nearest neightbor methode ............................................................................ 48
6.5 Technische details uitgelegd ..................................................................................................... 49
6.6 Clustering as similarity-based segmentation ............................................................................ 52
6.7 Clustering results ....................................................................................................................... 55
6.8 Wat hebben we tot nu toe gezien ............................................................................................. 56
7 What is a good model? ........................................................................................................ 57
7.1 Evaluating classifiers.................................................................................................................. 57
7.2 Generalizing beyond classification ............................................................................................ 59
7.3 Expected value framework ........................................................................................................ 59
7.4 Baseline performance ............................................................................................................... 63
8 Visualizing model performance ........................................................................................... 64
8.1 Ranking instead of classifying .................................................................................................... 64
8.2 Profit curves .............................................................................................................................. 65
8.3 ROC Graphs & curves ................................................................................................................ 66
8.4 Cumulative Response en lift curve ............................................................................................ 69
8.5 Voorbeeld churn ........................................................................................................................ 70
9 Evidence and Probabilities .................................................................................................. 73
9.1 Combining Evidence Probabilistically ........................................................................................ 73
9.2 Bayes’ Rule ................................................................................................................................ 75
9.3 Evidence lift ............................................................................................................................... 77
10 Representing and Mining Tekst ........................................................................................ 78
10.1 Tekst .......................................................................................................................................... 78
10.2 Terminologie (geleend uit IR = information retrieval) .............................................................. 79
10.3 Bag of words .............................................................................................................................. 79
10.4 Beyond bag of words ................................................................................................................. 82
10.5 Voorbeeld: Mining News Stories for stock price movement .................................................... 83
11 Decision Analytic Thinking ll: Toward Analytical Engineering ............................................ 86
11.1 Case: Geldinzameling vereniging............................................................................................... 86
11.2 Case: Churn................................................................................................................................ 88
12 Other Data Science Tasks and Techniques ........................................................................ 89
, 3
12.1 Co-occurrences & associations.................................................................................................. 89
12.2 Profiling ..................................................................................................................................... 91
12.3 Link prediction ........................................................................................................................... 92
12.4 Data reduction & latent information ........................................................................................ 92
12.5 Bias, variance & ensemble methods ......................................................................................... 93
12.6 Causal Explanation .................................................................................................................... 94
13 Data science and Business Strategy .................................................................................. 95
13.1 Competitief voordeel ................................................................................................................ 95
13.2 Data science management ........................................................................................................ 96
13.3 Aantrekken & behouden van data scientists ............................................................................ 97
13.4 Kleine bedrijven ......................................................................................................................... 97
13.5 Data science maturity ................................................................................................................ 97
13.6 Data mining voorstellen evalueren ........................................................................................... 97
14 Conclusie ........................................................................................................................ 98
14.1 Fundamentele concepten van data science .............................................................................. 98
14.2 Fundamentele concepten in een case....................................................................................... 98
14.3 Andere manier van denken aan een businessprobleem ........................................................... 99
14.4 Wat data niet kan doen ............................................................................................................. 99
14.5 Ethiek & privacy ......................................................................................................................... 99
14.6 Cloud sourcing ........................................................................................................................... 99
, 4
0 Introduction
Belang data voor bedrijven
Jaarlijks:
• Verdubbeld de hoeveelheid data
• Daalt de kost om data bij te houden
Big Data: Een brede verzameling aan data van verschillende bronnen
Maslow’s Hierarchy of Big Data:
• Data verzamelen, dit geeft ons informatie, hier vervolgens kennis uit halen
o Met wijsheid omgaan met deze data
Data warehouse vs data lake
• Data warehouse:
o Data wordt verwerkt in één schema voordat ze in het warehouse bijgehouden
worden
▪ Opmerking: Data is nooit beschikbaar in de vorm dat je ze nodig hebt
o De analyse gebeurt met de “cleansed” data
o ETL: Extract transform load
• Data lake:
o Data wordt “raw” en ongestructureerd bijgehouden in data lake
o Data wordt pas geselecteerd en georganiseerd wanneer dit nodig is
Figuur 1: Data warehouse vs data lake
Data in bedrijven: er is data aanwezig, dit wordt omgezet in inzichten waarop men reageert