Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien 4.2 TrustPilot
logo-home
Resume

Full summary of the course data mining

Vendu
7
Pages
177
Publié le
09-04-2025
Écrit en
2024/2025

includes notes of lectures anno 24-25 (includes all chapters given in 2025 (full lecture notes) with a more concise summary of some chapters at the start and some extra lectures of previous year that were skipped this year; mostly indicated) as well as some exam questions and practicals 3+4 2052FBDBMW

Montrer plus Lire moins

















Oups ! Impossible de charger votre document. Réessayez ou contactez le support.

Infos sur le Document

Publié le
9 avril 2025
Fichier mis à jour le
20 mai 2025
Nombre de pages
177
Écrit en
2024/2025
Type
Resume

Aperçu du contenu

DATA MINING
INHOUDSOPGAVE


data analysis .................................................................................................................................................... 7

introduction ......................................................................................................................................................... 7

processing principles ......................................................................................................................................... 10

unsupervised clustering ..................................................................................................................................... 13

principal component analysis ............................................................................................................................ 14

supervised learning ........................................................................................................................................... 15

regression .......................................................................................................................................................... 15

machine learning methods ................................................................................................................................ 17

introduction ................................................................................................................................................... 19

a bit of context .................................................................................................................................................. 19
Big data ......................................................................................................................................................... 19

what is data? ..................................................................................................................................................... 20
attribute values ............................................................................................................................................. 21
attribute types and properties ...................................................................................................................... 21

Dataset types .................................................................................................................................................... 22
record data .................................................................................................................................................... 22
graph data ..................................................................................................................................................... 23
ordered data ................................................................................................................................................. 23

what is data mining? ......................................................................................................................................... 23
statistics ........................................................................................................................................................ 24
challenges data mining ................................................................................................................................. 25

tasks .................................................................................................................................................................. 25

o ‘learning’ the patterns from the data ........................................................................................................ 25

o ‘discover’ the patterns in the data ........................................................................................................... 25
supervised classification................................................................................................................................ 25
unsupervised classification ........................................................................................................................... 27
AI, what does it mean now? .......................................................................................................................... 28

processing principles ...................................................................................................................................... 29

introduction ....................................................................................................................................................... 29

Unstructured data ............................................................................................................................................. 29

common data processing steps ......................................................................................................................... 29
feature extraction ......................................................................................................................................... 29
attribute transformation ............................................................................................................................... 30




1

, discretization ................................................................................................................................................. 30
Aggregation ................................................................................................................................................... 31
Noise removal ............................................................................................................................................... 31
Outlier removal ............................................................................................................................................. 32
Sampling ........................................................................................................................................................ 32
Handling duplicate data: data clean up ........................................................................................................ 33
Handling missing values ................................................................................................................................ 33
Dimensionality reduction .............................................................................................................................. 34

processing step for specific data types .............................................................................................................. 35
image data..................................................................................................................................................... 35
Survey data.................................................................................................................................................... 35
sequence data ............................................................................................................................................... 35
Network data ................................................................................................................................................ 36
Text data ....................................................................................................................................................... 36
Omics data .................................................................................................................................................... 37

Chapter 3: univariate techniques ................................................................................................................... 41

functional analysis of large data sets ................................................................................................................ 45

chapter 4: unsupervised clustering ................................................................................................................. 48

unsupervised versus supervised ........................................................................................................................ 48

clustering (examen) ........................................................................................................................................... 48
what is clustering? ........................................................................................................................................ 48

similarity ............................................................................................................................................................ 49
defining distance measures .......................................................................................................................... 49
what properties should a distance measure have? ...................................................................................... 49
Generic Technique – transformation distance / Edit distance ...................................................................... 49

dendograms( examen) ...................................................................................................................................... 50
a demonstration of hierarchial clustering using string edit distance ............................................................ 50

hierarchical clustering (examen) ...................................................................................................................... 50
bottom-up ..................................................................................................................................................... 51
methods to calculate distance between 2 clusters/ object and cluster ....................................................... 51

partitional clustering ......................................................................................................................................... 54
how many clusters (k)? ................................................................................................................................. 54

CHApter 5: principial component analysis= data projection ........................................................................... 57

introduction ....................................................................................................................................................... 57

multivariate data .............................................................................................................................................. 57
basic variable statistics: represent this data ................................................................................................. 57

data transformation .......................................................................................................................................... 57

normalization .................................................................................................................................................... 58




2

, comparison between variables ......................................................................................................................... 58
covariance ..................................................................................................................................................... 58
correlation= normalised version of covariance ............................................................................................. 58

data projection .................................................................................................................................................. 59
geometric interpretation .............................................................................................................................. 59
why use projections ...................................................................................................................................... 59

how PCA works .................................................................................................................................................. 60
loadings ......................................................................................................................................................... 61
scores ............................................................................................................................................................ 62
scree plot= variance in each principial component ...................................................................................... 62
example possum ........................................................................................................................................... 63
example nutrition .......................................................................................................................................... 64
influenza PCA ................................................................................................................................................ 66
metagenomics: enterotypes ......................................................................................................................... 66

t-SNE.................................................................................................................................................................. 67
how does it work ........................................................................................................................................... 68
Perplexity ...................................................................................................................................................... 69

chapter 6: supervised learning ....................................................................................................................... 70

the classification problem ................................................................................................................................. 70

the grasshopper problem .................................................................................................................................. 70
compile data set ............................................................................................................................................ 70

regression vs classification ................................................................................................................................ 71
linear classifier .............................................................................................................................................. 71
support vector machines svm ....................................................................................................................... 73
descision value .............................................................................................................................................. 73
predicitve accuracy ....................................................................................................................................... 74
confusion matrix (examen) ........................................................................................................................... 74
treshold and accuracy ................................................................................................................................... 76
ROC-curve (examen) ..................................................................................................................................... 77
PR-curve ........................................................................................................................................................ 78
ROC VS PR (examen) ..................................................................................................................................... 78

nearest neighbor classifier ................................................................................................................................ 79

regression ...................................................................................................................................................... 81

The regression problem ..................................................................................................................................... 81

simple linear regression .................................................................................................................................... 81

multiple linear regression .................................................................................................................................. 82
best fit ........................................................................................................................................................... 83
optimization problem ................................................................................................................................... 83
evaluation of the model ................................................................................................................................ 84




3

, non linear regression ......................................................................................................................................... 85

logisitc regression .............................................................................................................................................. 89

cox regression.................................................................................................................................................... 91

overfitting .......................................................................................................................................................... 91
How do we estimate the capacity of our model to overfit? ......................................................................... 91

speed and scalability ......................................................................................................................................... 93

interpretability .................................................................................................................................................. 93

robustness ......................................................................................................................................................... 94

feature selection................................................................................................................................................ 94
How do we mitigate the sensitivity to irrelevant features? .......................................................................... 94
Different methods feature selection ............................................................................................................. 95

regularized regression ....................................................................................................................................... 95
trade of between best fit, L1-norm and L2-norm ......................................................................................... 96

elastic net .......................................................................................................................................................... 87
common regularization regression approaches ............................................................................................ 88
examples ....................................................................................................................................................... 88

elastic net .............................................................................................................. Error! Bookmark not defined.
common regularization regression approaches ................................................ Error! Bookmark not defined.
examples ........................................................................................................... Error! Bookmark not defined.

machine learning methods ............................................................................................................................. 97

introduction ........................................................................................................... Error! Bookmark not defined.

classification ...................................................................................................................................................... 82
what do these methods have in common..................................................................................................... 83

decision trees..................................................................................................................................................... 97
how to build a deciscion tree ........................................................................................................................ 97
gini impurity .................................................................................................................................................. 97
example ......................................................................................................................................................... 99

random forests ................................................................................................................................................ 100
bootstrapping .............................................................................................................................................. 100
bagging ........................................................................................................................................................ 101
gini importance ........................................................................................................................................... 103
example of RF TCR binding .......................................................................................................................... 103
summary random forest ............................................................................................................................. 104

neural networks(examen) ............................................................................................................................... 104
single layer perceptron ............................................................................................................................... 105
training the neural network ........................................................................................................................ 107
disadvantages.............................................................................................................................................. 109

deep learning .................................................................................................................................................. 109




4

, applications deep learning .......................................................................................................................... 110

MPC ............................................................................................................................................................. 120

Exam question last year ............................................................................................................................... 120

sv practica .................................................................................................................................................... 125

automation...................................................................................................................................................... 125
theorie ......................................................................................................................................................... 125
oefening ...................................................................................................................................................... 125
new function ............................................................................................................................................... 126
lijst gebruiken .............................................................................................................................................. 126

reshaping......................................................................................................................................................... 127

multivariate data analysis ............................................................................................................................... 127
PCA .............................................................................................................................................................. 127
cluster analyse............................................................................................................................................. 128

machine learning............................................................................................................................................. 129
decision tree ................................................................................................................................................ 129
Random forest............................................................................................................................................. 129
roc curve: zien welke beter is ...................................................................................................................... 129

regularized regression ..................................................................................................................................... 129

typical R commands ..................................................................................................................................... 130

tabellen van excel naar tekst file naar R ....................................................................................................... 131

in Excel aanpassen .......................................................................................................................................... 131

tekst file ........................................................................................................................................................... 132

in R aanpassen ................................................................................................................................................ 132

tabellen(files) in R zetten en aanpassen ....................................................................................................... 132

export a graph in PDF ................................................................................................................................... 134

grafieken maken .......................................................................................................................................... 135

automation .................................................................................................................................................. 135

automation of repetitive analyses .................................................................................................................. 135
Oefening ...................................................................................................................................................... 137

automation with a new function ..................................................................................................................... 141
oefening: een nieuwe functie ..................................................................................................................... 141
oefening: gebruik van een lijst ................................................................................................................... 143
oefening: combinatie van for-loops, functions and lists ............................................................................. 144




5

,reshaping ..................................................................................................................................................... 144

oefening ...................................................................................................................................................... 145

multivariate data analysis ............................................................................................................................ 149

Principal Component Analysis: the hepathlon dataset ................................................................................... 149
oefening 2 PCA ............................................................................................................................................ 151
oefening 3 PCA ........................................................................................................................................... 152
oefening 4 PCA ............................................................................................................................................ 153
oefening 5 PCA ............................................................................................................................................ 154

cluster analysis: the wine dataset ................................................................................................................... 155

hierarchial cluster analysis .............................................................................................................................. 156
Q1: How many clusters would you expect, based upon the dendrogram? ................................................ 157
Q2: Is the clustering approximately in agreement with the origin of the wines (Note: zoom in to be able to
read the labels) ........................................................................................................................................... 157
Q3: As pointed out in the theory lesson, there are several ways to calculate the dissimilarity between
clusters, including : single linkage, complete linkage, average linkage and Ward linkage. These are also
referred to as “agglomeration methods”.................................................................................................... 157

partitional clustering ....................................................................................................................................... 157

Om zelf te doen ............................................................................................................................................... 158

machine learning.......................................................................................................................................... 161

supervised classification with decision trees and random forests................................................................... 161
the breast cancer dataset ........................................................................................................................... 161
decision tree ................................................................................................................................................ 162
random forest ............................................................................................................................................. 165
heart disease ............................................................................................................................................... 168
exercise: unsupervised methods ................................................................................................................. 170

regularized regression ..................................................................................................................................... 170
student data set .......................................................................................................................................... 170




6

,DATA ANALYSIS

INTRODUCTION

• Biomedical data within a multidisciplinary gland
o Look at data instead of the classical studies → few individuals measuring a single parameter
o Wide spectrum
• BIG DATA is data for which conventional, computer-techniques are not sufficient anymore due to size, complexity
• It is a disruptive trend
 Need different data mining techniques to acquire information, like AI
• BIG DATA is characterized by volume, velocity, variety and veracity
o Volume = size of data → collected everywhere
▪ Has become very cheap to acquire the data
▪ One of the biggest costs = data analysis (FASTQ file, etc.)
o Velocity = speed at which data is being generated= enormous
▪ Like a smartphone = location tracking, fit application, wifi, etc.
▪ At any given point in time → lots of data
o Variety = diversity data that 80% is heterogeneous and unstructured
o Veracity = trustworthiness
▪ Biggest problem
▪ How reliable
▪ Always can go wrong → can’t fully trust the data
• Mislabelling, etc.
• Needs to be excluded or we just need to deal with it
• DATA management gap = too much data to actually analyse it → needs to be shifted through
o Data is so rapid that we need satellites to connect places => need for more high tech data options
o Need to consider how much data is involved
• “Data science” → diverse opinion on how the definitions should be applied
• DATA is a collection of data objects (= samples) and their attributes
o Feature = Attribute = property/characteristic of an object → column (specific well-defined features)
▪ For example: eye colour, ID, location
▪ Discrete = geen kommagetal
• Eye color, house numbers
▪ Continuous = real numbers
• Temp, height, weight
o Object = collection of attributes → row
▪ Sample, individual
o Attribute values = numbers or symbols assigned to an attribute
▪ Eye color: blue, green, brown,…
▪ Nominal = eye color, sex, ID, zip codes
▪ Ordinal = height (tall, medium, short), grades →
higher is better (there is an order score)
▪ Interval = calendar dates, temperature
• No zero
▪ Ratio = temperature in kelvin, length
• True zero




7

, • Dataset types
o Record data = collection of records with objects and attributes
▪ Data matrix = objects same fixed set of numeric attributes
▪ Document data= document becomes a term vector
• Each term is an attribute of the vector and value of each term is the number of times
the corresponding term occurs in the document
• More empty then filled= sparce matrix
▪ Transaction data= each record involves a set of items
• Grocery list of different people
o Graph data= network that consists of notes and their interactions
▪ Organic chemistry
▪ Molecular
▪ Interation networks
o Ordered data: Molecular sequences , temporal data(climate information in space and time→
temperature, has a clear structure
▪ Not in our strict data type
▪ Fasta file, etc.
▪ Weather data, …

DATAMINING is converting extracted information into useful knowledge= discover meaningful patterns

• Non-trivial extraction of implicit, previously unknown and potentially useful information from data
• Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover
meaningful patterns
• Related to statistics (probability theory)
o Sounds like pattern, finding a good model and selecting a good clustering/classifier
o BUT much smaller and simpler data, finds associations between attributes, not values, not generate
hypotheses but verifies them → not find anything




Two main goals:
Description: data summarisation e.g. average, min/max values, empirical probabilities, etc. i.e. consider the data
Inference: extract information e.g. hypothesis testing, estimation, correlation analysis, etc. i.e. guess and estimate the
distribution underlying the data, use it to do stuff

Challenges

- Scalability
- Dimensionality
- Complex and heterogeneous data
- Data quality
- Data ownership and distribution
- Privacy preservation
- Streaming data




8

,Garbage in = garbage out
• The importance of good data
o Good data will (likely) give you good results
o Everybody thinks their data is great
• If pattern is found, the expert might not like the result: who is to blame?
o The method? The expert?

MACHINE LEARNING= a subfield of artificial intelligence (AI) that focuses on the development of algorithms and models that
enable computers or machines to learn and make predictions or decisions without explicit programming.




• SUPERVISED CLASSIFICATION= Learning the patterns from the data → predict unknown value
o Predicts a label (classification) and predict a continuous attribute (regression)
o Step 1: extract features of dogs and cats
o Step 2: make a collection of objects: every dog or cat is a spot in space
o Step 3: new unknown observation in data
o Step 4: use decision boundary: weight to split up the 2 objects in space
o Methods to train model and estimate a decision boundry
▪ Support vector machine
▪ Decision tree
▪ Random forest
▪ Neural networks
o Workflow
▪ Give it example to learn from( supervised classification)
▪ Let it learn features and build a model (decision boundary)
▪ Once we have a model, we can use it to classify unknows
• UNSUPERVISED CLASSIFICATIONS= computer detects interpretable patterns that describe the data (no predefined
answer) = patterns in the dataset
• Maket basket analysis = finding a few patterns
o Hierarchical clustering
o Association rule analysis
o Principal component analysis
o Need “Smart algorithms” -> frequent pattern mining
o Can very rapidly find all possible patterns within specific criteria
o Such approaches will be discussed in detail later!

Outlier detection = we don’t know in advance what an outlier is → can’t really quantify (need an unsupervised method)

- Identification of an atypical sample or feature in a data set.
- Common first step in biomedical data analysis. E.g.- Data projection




9

, PROCESSING PRINCIPLES

• Starting material from where you initiate data mining= dirty data
o What you want= clean, normalized, structured, complete, non redundant, etc.
▪ Some techniques can deal with noise
▪ No duplicates = non redundant
o What you need= sample x feature matrix where each feature is a ratio or Boolean variable
▪ Pre-processing and transformation needed
▪ Unstructured data = no pred-defined structure
▪ Depends on the method we want to use
▪ Samples : rows
▪ Features : columns
• Processing steps (data set to data matrix)
o We want structured data
▪ Standardizes how data is related
▪ Determines structure
▪ Model can be represented in notation
▪ In a lot of cases do need to integrate different data types → need to be in a combined
representation
o Unstructured data
▪ No predefined structure
• Often txt-heavy, irregularities, …
• Need to find a way to extract knowledge
a. Feature extraction: Most data mining methods work best on numerical data matrices
i. Hopefully numerical, ratio features
ii. Take a data set and convert
iii. Data set of different patients with different blood types
iv. Simply text data → doesn’t understand differences so neeeed to extract features that captue
what we want to learn → define features like is the blood type resus + → true = 1 false = 0
(numerical feature that the data mining features can understand)
v. By defining the features we have captured all the features that we want
b. Attribute transformation
i. A function that maps the entire set of values of a given attribute to a new set of replacement
values such that each value can be identified with one of the new values
ii. Converting temperature from F to Kelvin
iii. Log transformation = monotonic (order of the values doesn’t change) → makes it more normal
→ can apply some of the more standard statistical methods → improves linearity as it
transforms multiplicative effects in to additive effects. (why the log is quite usefull, prior to
acquiring data methods.)
iv. Z- normalization = subtract the mean and set std to 1 also monotonic (if we have a value of 1
than we can compare fe both above mean and both with std)
v. IQR normalization = interquartile normalization → especially if we have .. → divide by the
median = can’t guarantee that the mean is zero
vi. Mapping data to a new space = fourier and wavelet
c. Discretization
i. A process of converting or partitioning continuous attributes, features or variables to
discretized or nominal attributes, features, variables or intervals
ii. Purpose:
• Noise reduction
• Focus on relevant intervals




10
€18,96
Accéder à l'intégralité du document:

Garantie de satisfaction à 100%
Disponible immédiatement après paiement
En ligne et en PDF
Tu n'es attaché à rien


Document également disponible en groupe

Thumbnail
Package deal
2nd semester first master ITD
-
6 2025
€ 82,16 Plus d'infos

Reviews from verified buyers

Affichage de tous les avis
6 mois de cela

4,0

1 revues

5
0
4
1
3
0
2
0
1
0
Avis fiables sur Stuvia

Tous les avis sont réalisés par de vrais utilisateurs de Stuvia après des achats vérifiés.

Faites connaissance avec le vendeur

Seller avatar
Les scores de réputation sont basés sur le nombre de documents qu'un vendeur a vendus contre paiement ainsi que sur les avis qu'il a reçu pour ces documents. Il y a trois niveaux: Bronze, Argent et Or. Plus la réputation est bonne, plus vous pouvez faire confiance sur la qualité du travail des vendeurs.
biomed124 Universiteit Antwerpen
Voir profil
S'abonner Vous devez être connecté afin de suivre les étudiants ou les cours
Vendu
26
Membre depuis
3 année
Nombre de followers
2
Documents
32
Dernière vente
4 mois de cela
biomed

4,0

2 revues

5
0
4
2
3
0
2
0
1
0

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Une qualité sur laquelle compter : rédigé par des étudiants qui ont réussi et évalué par d'autres qui ont utilisé ce document.

Le document ne convient pas ? Choisis un autre document

Aucun souci ! Tu peux sélectionner directement un autre document qui correspond mieux à ce que tu cherches.

Paye comme tu veux, apprends aussitôt

Aucun abonnement, aucun engagement. Paye selon tes habitudes par carte de crédit et télécharge ton document PDF instantanément.

Student with book image

“Acheté, téléchargé et réussi. C'est aussi simple que ça.”

Alisha Student

Foire aux questions