100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Summary Data Mining for Business & Governance full course

Beoordeling
-
Verkocht
3
Pagina's
133
Geüpload op
29-04-2021
Geschreven in
2020/2021

Summary of 133 pages for the course Data Mining For Business And Governance at UVT (Full course notes)












Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
29 april 2021
Aantal pagina's
133
Geschreven in
2020/2021
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

DATA MINING FOR BUSINESS AND GOVERNANCE
Chris Emmery, Çiçek Güven & Gonzalo Nápoles



TABLE OF CONTENTS

Introduction to Data Mining ........................................................................................................................... 5
1. What is Data Mining? ................................................................................................................................ 5
1.1. Key aspects: Computation & Large data sets .................................................................................... 5
1.2. Big Data ............................................................................................................................................. 6
1.3. Applications ....................................................................................................................................... 6
2. What makes prediction possible?............................................................................................................... 6

3. Data Mining as Applied Machine Learning ................................................................................................ 7
3.1. Supervised learning ........................................................................................................................... 7
3.2. Unsupervised Learning ...................................................................................................................... 8

Introduction to Data Science ......................................................................................................................... 10
1. What is data science?............................................................................................................................... 10
1.1. Example ........................................................................................................................................... 10
1.2. Terminology..................................................................................................................................... 10
1.3. The algorithm .................................................................................................................................. 12
1.4. Evaluation ........................................................................................................................................ 12
1.5. Computer hardware ........................................................................................................................ 13
2. Representing data .................................................................................................................................... 14
2.1. How do we get data? ....................................................................................................................... 14
2.2. File formats: raw-level representation of files ................................................................................ 15
2.3. Databases: storing the data a bit more cleverly .............................................................................. 16
2.4. Data science in practice: 80% vs. 20% ............................................................................................. 16
2.5. Representation of data .................................................................................................................... 16

Articles week 1 ............................................................................................................................................. 17

Prediction (SL): regression & classification .................................................................................................... 20
1. What makes prediction possible?............................................................................................................. 20
1.1. Correlation Coefficient: Pearson’s r................................................................................................. 20
2. Regression ................................................................................................................................................ 23

3. Classification ............................................................................................................................................ 24
3.1. Decision boundaries to label parts of a data as being a certain category ....................................... 26
3.2. ML algorithms for classification using decision boundaries ............................................................ 26
3.3. Multiclass classification (ó binary classification) ........................................................................... 35
4. Fitting and tuning ..................................................................................................................................... 36
4.1. Fitting............................................................................................................................................... 37



1

, 4.2. Tuning .............................................................................................................................................. 38

5. Evaluation ................................................................................................................................................ 43
5.1. Metrics for evaluating a Regression Task ........................................................................................ 43
5.2. Metrics for evaluating a Classification Task..................................................................................... 43
5.3. Schemes for applying metrics in model selection ........................................................................... 46
5.4. Best practices & common pitfalls .................................................................................................... 49
6. Models ...................................................................................................................................................... 55
6.1. Model selection ............................................................................................................................... 55
6.2. What is ‘learning’? ........................................................................................................................... 55

Working with Text data ................................................................................................................................ 56
1. Representing text as vectors .................................................................................................................... 56
1.1. Converting to numbers .................................................................................................................... 56
2. Binary vectors for Decision Tree classification (ID3) ................................................................................. 58
2.1. Inferring rules (decisions) by information gain: EX: Spam detection .............................................. 58
3. Using Vector Spaces and weightings ........................................................................................................ 62
3.1. Binary vs. Frequency........................................................................................................................ 62
3.2. Term frequencies............................................................................................................................. 62
3.3. (Inverse) document frequency ........................................................................................................ 64
3.4. Putting it together: tf * idf weighting............................................................................................... 64
3.5. Normalizing vector representations ................................................................................................ 65
4. Document classification using 𝑘-NN ........................................................................................................ 66
4.1. 𝓵𝟐 normalization ............................................................................................................................. 66
4.2. Cosine similarity .............................................................................................................................. 67
4.3. Using similarity in 𝒌-nn.................................................................................................................... 67
5. Practical examples.................................................................................................................................... 70
5.1. Naive text cleaning .......................................................................................................................... 70
6. Document classification ........................................................................................................................... 73
6.1. Sentiment analysis ........................................................................................................................... 73
6.2. Build a model ................................................................................................................................... 75
6.3. Test our model ................................................................................................................................ 82

Dimensionality reduction .............................................................................................................................. 83
1. The importance of dimensions ................................................................................................................. 83

2. Visualization ............................................................................................................................................. 85
2.1. Box plots .......................................................................................................................................... 85
2.2. Histogram ........................................................................................................................................ 85
2.3. Scatter plots..................................................................................................................................... 85
3. Dimensionality reduction ......................................................................................................................... 86
3.1. Feature selection ............................................................................................................................. 86
3.2. Feature extraction ........................................................................................................................... 88
4. Deep neural networks .............................................................................................................................. 90

Unsupervised learning .................................................................................................................................. 91




2

, 1. Techniques................................................................................................................................................ 92
1.1. CRISP trough k-means algorithm (most important method) ........................................................... 92
1.2. Fuzzy trough Fuzzy c-means algorithm............................................................................................ 93
1.3. Hierarchical clustering ..................................................................................................................... 95
2. Distance function...................................................................................................................................... 96
3. Evaluation method ................................................................................................................................... 97
3.1. The Silhouette coefficient/score ..................................................................................................... 97
3.2. Dunn index ...................................................................................................................................... 97

Association mining........................................................................................................................................ 98
1. Measures: support & confidence .............................................................................................................. 99
1.1. Support ............................................................................................................................................ 99
1.2. Confidence....................................................................................................................................... 99
2. Mining association rules......................................................................................................................... 100
3. A priori algorithm ................................................................................................................................... 101
3.1. The algorithm ................................................................................................................................ 101
3.2. Considerations ............................................................................................................................... 102
3.3. Setting the support parameter (minsup)....................................................................................... 102
3.4. Pattern evaluation ......................................................................................................................... 103
4. Itemset taxonomy .................................................................................................................................. 104
4.1. Maximal frequent itemset ............................................................................................................. 104
4.2. Closed itemset ............................................................................................................................... 104
4.3. Maximal vs. closed......................................................................................................................... 105
5. Quantitative association rules ................................................................................................................ 105

Mining massive data ................................................................................................................................... 107
1. Parallelization......................................................................................................................................... 107
1.1. Requirements ................................................................................................................................ 108
1.2. How does parallelization work? .................................................................................................... 109
2. Bagging, Boosting, and Batching ........................................................................................................... 111
2.1. Boosting (ex. AdaBoost) ................................................................................................................ 111
2.2. Averaging (ex. Bagging, Random Forests) ..................................................................................... 113
2.3. Batching (online learning) ............................................................................................................. 115
2.4. Drawbacks of ensemble methods ................................................................................................. 116

3. Distributed Computing ........................................................................................................................... 117
3.1. Distributing Machine Learning models .......................................................................................... 117
3.2. Distributed file storage .................................................................................................................. 118
3.3. Map reduce ................................................................................................................................... 119

Deep learning ............................................................................................................................................. 121
1. A brief history of AI ................................................................................................................................. 121
1.1. Alan Turing .................................................................................................................................... 121
1.2. Sci-project (1974) .......................................................................................................................... 122
1.3. The Sojourner Rover (1997) .......................................................................................................... 123
1.4. “Sub-symbolic” AI (1988-2016) ..................................................................................................... 123



3

, 2. Recognizing patterns .............................................................................................................................. 123
2.1. Neural networks ............................................................................................................................ 123
2.2. McCulloch-Pitts Neurons (1947).................................................................................................... 125
2.3. Deep Learning (2015) .................................................................................................................... 126
3. Many successes of DL ............................................................................................................................. 131
4. Conclusion .............................................................................................................................................. 133




4

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
clairevanroey Universiteit Antwerpen
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
119
Lid sinds
8 jaar
Aantal volgers
96
Documenten
32
Laatst verkocht
11 maanden geleden

3,1

13 beoordelingen

5
3
4
4
3
0
2
3
1
3

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen