H0 INLEIDING ................................................................................................................................................. 1
WAAROM IS DATA SCIENCE BELANGRIJK VOOR BEDRIJVEN? ...................................................................................... 1
Wet van de massale digitale opslag ....................................................................................................................... 1
Big data................................................................................................................................................................ 1
Maslows hiërarchie van big data ............................................................................................................................ 1
Data warehouses & data marts ......................................................................................................................... 1
Data lakes ............................................................................................................................................................ 2
Data warehouse VS. data lakes ............................................................................................................................. 2
Data in bedrijven .................................................................................................................................................. 2
Data value trap ..................................................................................................................................................... 2
H1.1 DATA-ANALYTICAL THINKING .................................................................................................................... 3
INTRODUCTIE .......................................................................................................................................................... 3
WAAROM DATA-ANALYTICAL THINKING EN DATA SCIENCE? ....................................................................................... 3
Data opportunities ................................................................................................................................................ 3
Compliance to regulations − naleving van de voorschriften ..................................................................................... 4
Possible applications ........................................................................................................................................... 5
VOORBEELDEN ........................................................................................................................................................ 6
Hurricane Frances − WalMart ................................................................................................................................ 6
Pregnancy prediction − Target ............................................................................................................................... 6
Churn prediction − Megatrends ............................................................................................................................. 6
WAT IS DATA-ANALYTICAL THINKING? ....................................................................................................................... 6
Data science capability as strategic asset .......................................................................................................... 7
Signet Bank VS. Capital One .................................................................................................................................. 7
Amazon ............................................................................................................................................................... 8
Harrah’s Casinos .................................................................................................................................................. 8
Waardering van Facebook en Twitter ..................................................................................................................... 8
WAT IS DATA SCIENCE OF DATAWETENSCHAP? .......................................................................................................... 8
SAMENVATTING ..................................................................................................................................................... 11
H1.2 BUSINESS PROBLEMS & DATA SCIENCE SOLUTIONS ................................................................................ 12
VERSCHILLENDE DATAMINING TAKEN ..................................................................................................................... 12
Classification & class probability estimation ........................................................................................................ 12
Regression ......................................................................................................................................................... 12
Similarity matching ............................................................................................................................................. 12
Clustering .......................................................................................................................................................... 13
Co-occurrence grouping ..................................................................................................................................... 13
Profiling ............................................................................................................................................................. 13
Link prediction ................................................................................................................................................... 13
Data reduction ................................................................................................................................................... 13
Causal modeling ................................................................................................................................................ 13
Conclusion ........................................................................................................................................................ 14
Two high-level primary goals: prediction and description ............................................................................. 14
, SUPERVISED VS. UNSUPERVISED METHODS............................................................................................................ 14
Voorbeeld .......................................................................................................................................................... 14
HET DATAMINING PROCES ...................................................................................................................................... 15
Belangrijk onderscheid ..................................................................................................................................... 15
Knowledge discovery in databases ................................................................................................................... 15
ANDERE ANALYSETECHNIEKEN EN -TECHNOLOGIEËN............................................................................................. 17
Statistics ............................................................................................................................................................ 17
Database querying ............................................................................................................................................ 17
OLAP-tools......................................................................................................................................................... 17
Data warehousing .............................................................................................................................................. 18
Regression analysis .......................................................................................................................................... 18
Machine learning (AI) and datamining (KDD) ..................................................................................................... 18
H2.1 INTRODUCTION TO PREDICTIVE MODELING ............................................................................................. 19
TERMINOLOGIE ..................................................................................................................................................... 19
Model ................................................................................................................................................................ 19
In data science ................................................................................................................................................... 19
Two high-level primary goals: prediction & description ..................................................................................... 19
Instance............................................................................................................................................................. 19
Inductie & deductie .......................................................................................................................................... 19
SUPERVISED SEGMENTATIE .................................................................................................................................... 19
Complicaties ..................................................................................................................................................... 20
HET SELECTEREN VAN INFORMATIEVE ATTRIBUTEN ................................................................................................. 21
Entropie ............................................................................................................................................................. 21
Information gain ................................................................................................................................................ 22
Voorbeeld: IG berekenen .................................................................................................................................... 22
Numeric values ................................................................................................................................................. 23
Regressieproblemen .......................................................................................................................................... 23
SUPERVISED SEGMENTATIE MET BOOMSTRUCTUURMODELLEN ............................................................................... 23
Voorbeeld .......................................................................................................................................................... 24
Lichaamsvorm ................................................................................................................................................... 24
Samenvatting ..................................................................................................................................................... 25
ANDERE VOORSTELLINGEN .................................................................................................................................... 26
Visualisatie van segmenten .............................................................................................................................. 26
Decision lines & hyper planes (beslissingslijnen & hypervlakken) ................................................................ 26
Bomen als reeksen van regels ......................................................................................................................... 27
PROBABILITY ESTIMATION (WAARSCHIJNLIJKHEIDSSCHATTING) .............................................................................. 27
Voorbeeld .......................................................................................................................................................... 27
H2.2 FITTING A MODEL TO DATA....................................................................................................................... 28
CONTENTS ............................................................................................................................................................... 28
Decision Trees vS. parametric modeling ......................................................................................................... 28
Drie assumpties ................................................................................................................................................. 28
LINEAR DISCRIMINANT FUNCTIONS ........................................................................................................................ 28
Instance space ................................................................................................................................................... 28
, lineaire discriminerende functie .......................................................................................................................... 29
Optimaliseren v/d objective function ................................................................................................................... 30
Voorbeeld lineaire discriminatie ....................................................................................................................... 30
CLASSIFICATION: SCORING & RANKING .................................................................................................................. 30
LINEAR MODEL FOR CLASSIFICATION................................................................................................................................ 31
SUPPORT VECTOR MACHINES (SVM) ................................................................................................................... 31
Logistieke regressie ............................................................................................................................................ 32
Linear regression ................................................................................................................................................ 33
WHAT IF THE DATA IS NON-LINEAR? ......................................................................................................................... 34
H3.1 OVERFITTING & ITS AVOIDANCE ................................................................................................................ 35
OVERFITTING ......................................................................................................................................................... 35
Definitie ............................................................................................................................................................. 35
Wat nu? ............................................................................................................................................................. 35
Holdout data & fitting graphs ............................................................................................................................... 35
VOORSPELLINGSTECHNIEKEN & OVERFITTING ................................................................................................... 36
WAAROM IS OVERFITTEN SLECHT?...................................................................................................................... 39
AVOIDING OVERFITTING !!! ............................................................................................................................................ 40
CROSS VALIDATION ........................................................................................................................................... 40
LEARNING CURVES ............................................................................................................................................ 42
VERMIJDEN VAN OVERFITTING & COMPLEXITEITSCONTROLE .............................................................................. 42
H3.2 SIMILARITY, NEIGHBORS & CLUSTERS ....................................................................................................... 45
CALCULATE SIMILARITY ............................................................................................................................................. 45
gEBRUIK VAN SIMILARITY .................................................................................................................................... 45
AFSTAND ........................................................................................................................................................... 46
NEAREST-NEIGHBOUR REASONING (NN) ............................................................................................................ 47
Goniometrische interpretatie, overfitting & complexity control .............................................................................. 50
3 problemen met k-NN ....................................................................................................................................... 51
Technische details m.b.t. NN Heterogene attributen............................................................................................. 52
Technische details m.b.t. Andere afstandsfuncties ............................................................................................... 52
CLUSTERING AS SIMILARITY-BASED SEGMENTATION ............................................................................................... 54
Supervised vs. unsupervised ............................................................................................................................... 54
Clustering = unsupervised segmentation ............................................................................................................. 54
2 soorten clustering ............................................................................................................................................ 55
Hiërarchische clustering vs. centroid clustering (k-means) ................................................................................... 58
Clustering resultaten .......................................................................................................................................... 58
H4.1 DECISION ANALYTICAL THINKING 1 : WHAT IS A GOOD MODEL? ....................................................... 59
INTRODUCTIE ........................................................................................................................................................ 59
EVALUEREN VAN CLASSIFIERS ................................................................................................................................ 59
Plain accuracy ................................................................................................................................................... 59
Probleem met ongebalanceerde klassen ............................................................................................................. 60
Confusion matrix ................................................................................................................................................ 61
Problemen met ongelijke kosten en baten ............................................................................................................ 63
GENERALIZING BEYOND CLASSIFIERS ..................................................................................................................... 63
, Algemene principe .............................................................................................................................................. 63
EXPECTED VALUE FRAMEWORK .............................................................................................................................. 64
Using expected value to frame classifier use ........................................................................................................ 64
Gebruik v/d expected value voor de evaluatie v/d classifier ................................................................................... 65
Kosten & baten binnen expected value framework ................................................................................................ 66
BASELINE PERFORMANCE (& CONSEQUENCES) ........................................................................................................... 69
Baseline model .................................................................................................................................................. 69
Algemene principes ............................................................................................................................................ 69
Andere ............................................................................................................................................................... 70
H4.2 VISUALISING MODEL PERFORMANCE ....................................................................................................... 71
RANKING IN PLAATS VAN CLASSIFICEREN .......................................................................................................... 71
WINSTCURVES ................................................................................................................................................... 73
ROC curves & AUC (Area under curve) ................................................................................................................. 74
CUMULATIEVE RESPONS- & LIFTCURVES ............................................................................................................ 77
VOORBEELD CHURNPREDICTION ...................................................................................................................... 78
H5.1 EVIDENCE AND PROBABILITIES ................................................................................................................ 82
VOORBEELD ...................................................................................................................................................... 82
COMBINING EVICENCE PROBABILISTICALLY ...................................................................................................... 82
JOINT PROBABILITY & INDEPENDENCE ............................................................................................................... 83
BAYES' RULE ...................................................................................................................................................... 83
Het toepassen van de bayes’ rule op data science ................................................................................................ 84
Conditional independence & naive bayes............................................................................................................. 85
Voordelen & nadelen van naïve bayes .................................................................................................................. 86
EEN MODEL VAN BEWIJSVOERING "LIFT" ............................................................................................................ 86
Voorbeeld: bewijsliften van facebook likes ........................................................................................................... 86
SAMENVATTING ................................................................................................................................................. 87
H5.2 REPRESENTING AND MINING TEXT ............................................................................................................ 88
DATA PREPARATION ............................................................................................................................................... 88
WAAROM IS TEKST BELANGRIJK? ............................................................................................................................ 88
WAAROM IS TEKST MOEILIJK? ................................................................................................................................. 88
REPRESENTATION - WEERGAVE .............................................................................................................................. 89
Bag of words ...................................................................................................................................................... 89
Term frequency .................................................................................................................................................. 89
Normalisatie en stemming .................................................................................................................................. 90
meten van spaarzaamheid (sparseness): inverse document frequency .................................................................. 91
Combinatie van TF & IDF: TFIDF .......................................................................................................................... 92
VOORBEELD.............................................................................................................................................................. 92
THE RELATIONSHIP OF IDF TO ENTROPY ........................................................................................................................... 93
BEYOND BAG OF WORDS ....................................................................................................................................... 94
N-gram sequence ............................................................................................................................................... 94
Named Entity Extraction ...................................................................................................................................... 94
Topic models ..................................................................................................................................................... 95
VOORBEELD: DATAMINING OM DE KOERSBEWEGING TE VOORSPELLEN .................................................................. 96