Resumen

Full summary of the course data mining

Name: Full summary of the course data mining
SKU: doc_7719670
Rating: 4.00 (1 reviews)
Author: biomed124

Puntuación

4.0

(1)

Vendido

Páginas

177

Subido en

09-04-2025

Escrito en

2024/2025

includes notes of lectures anno 24-25 (includes all chapters given in 2025 (full lecture notes) with a more concise summary of some chapters at the start and some extra lectures of previous year that were skipped this year; mostly indicated) as well as some exam questions and practicals 3+4 2052FBDBMW

Mostrar más Leer menos

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Universiteit Antwerpen (UA)
Estudio: Biomedische Wetenschappen
Grado: Data mining (2052FBDBMW)

Todos documentos para esta materia (11)

Información del documento

Subido en: 9 de abril de 2025
Archivo actualizado en: 20 de mayo de 2025
Número de páginas: 177
Escrito en: 2024/2025
Tipo: Resumen

Temas

practicals
data mining
advanced
data
analysis
summary

Vista previa del contenido

DATA MINING
INHOUDSOPGAVE

data analysis .................................................................................................................................................... 7

introduction ......................................................................................................................................................... 7

processing principles ......................................................................................................................................... 10

unsupervised clustering ..................................................................................................................................... 13

principal component analysis ............................................................................................................................ 14

supervised learning ........................................................................................................................................... 15

regression .......................................................................................................................................................... 15

machine learning methods ................................................................................................................................ 17

introduction ................................................................................................................................................... 19

a bit of context .................................................................................................................................................. 19
Big data ......................................................................................................................................................... 19

what is data? ..................................................................................................................................................... 20
attribute values ............................................................................................................................................. 21
attribute types and properties ...................................................................................................................... 21

Dataset types .................................................................................................................................................... 22
record data .................................................................................................................................................... 22
graph data ..................................................................................................................................................... 23
ordered data ................................................................................................................................................. 23

what is data mining? ......................................................................................................................................... 23
statistics ........................................................................................................................................................ 24
challenges data mining ................................................................................................................................. 25

tasks .................................................................................................................................................................. 25

o ‘learning’ the patterns from the data ........................................................................................................ 25

o ‘discover’ the patterns in the data ........................................................................................................... 25
supervised classification................................................................................................................................ 25
unsupervised classification ........................................................................................................................... 27
AI, what does it mean now? .......................................................................................................................... 28

processing principles ...................................................................................................................................... 29

introduction ....................................................................................................................................................... 29

Unstructured data ............................................................................................................................................. 29

common data processing steps ......................................................................................................................... 29
feature extraction ......................................................................................................................................... 29
attribute transformation ............................................................................................................................... 30

1

, discretization ................................................................................................................................................. 30
Aggregation ................................................................................................................................................... 31
Noise removal ............................................................................................................................................... 31
Outlier removal ............................................................................................................................................. 32
Sampling ........................................................................................................................................................ 32
Handling duplicate data: data clean up ........................................................................................................ 33
Handling missing values ................................................................................................................................ 33
Dimensionality reduction .............................................................................................................................. 34

processing step for specific data types .............................................................................................................. 35
image data..................................................................................................................................................... 35
Survey data.................................................................................................................................................... 35
sequence data ............................................................................................................................................... 35
Network data ................................................................................................................................................ 36
Text data ....................................................................................................................................................... 36
Omics data .................................................................................................................................................... 37

Chapter 3: univariate techniques ................................................................................................................... 41

functional analysis of large data sets ................................................................................................................ 45

chapter 4: unsupervised clustering ................................................................................................................. 48

unsupervised versus supervised ........................................................................................................................ 48

clustering (examen) ........................................................................................................................................... 48
what is clustering? ........................................................................................................................................ 48

similarity ............................................................................................................................................................ 49
defining distance measures .......................................................................................................................... 49
what properties should a distance measure have? ...................................................................................... 49
Generic Technique – transformation distance / Edit distance ...................................................................... 49

dendograms( examen) ...................................................................................................................................... 50
a demonstration of hierarchial clustering using string edit distance ............................................................ 50

hierarchical clustering (examen) ...................................................................................................................... 50
bottom-up ..................................................................................................................................................... 51
methods to calculate distance between 2 clusters/ object and cluster ....................................................... 51

partitional clustering ......................................................................................................................................... 54
how many clusters (k)? ................................................................................................................................. 54

CHApter 5: principial component analysis= data projection ........................................................................... 57

introduction ....................................................................................................................................................... 57

multivariate data .............................................................................................................................................. 57
basic variable statistics: represent this data ................................................................................................. 57

data transformation .......................................................................................................................................... 57

normalization .................................................................................................................................................... 58

2

, comparison between variables ......................................................................................................................... 58
covariance ..................................................................................................................................................... 58
correlation= normalised version of covariance ............................................................................................. 58

data projection .................................................................................................................................................. 59
geometric interpretation .............................................................................................................................. 59
why use projections ...................................................................................................................................... 59

how PCA works .................................................................................................................................................. 60
loadings ......................................................................................................................................................... 61
scores ............................................................................................................................................................ 62
scree plot= variance in each principial component ...................................................................................... 62
example possum ........................................................................................................................................... 63
example nutrition .......................................................................................................................................... 64
influenza PCA ................................................................................................................................................ 66
metagenomics: enterotypes ......................................................................................................................... 66

t-SNE.................................................................................................................................................................. 67
how does it work ........................................................................................................................................... 68
Perplexity ...................................................................................................................................................... 69

chapter 6: supervised learning ....................................................................................................................... 70

the classification problem ................................................................................................................................. 70

the grasshopper problem .................................................................................................................................. 70
compile data set ............................................................................................................................................ 70

regression vs classification ................................................................................................................................ 71
linear classifier .............................................................................................................................................. 71
support vector machines svm ....................................................................................................................... 73
descision value .............................................................................................................................................. 73
predicitve accuracy ....................................................................................................................................... 74
confusion matrix (examen) ........................................................................................................................... 74
treshold and accuracy ................................................................................................................................... 76
ROC-curve (examen) ..................................................................................................................................... 77
PR-curve ........................................................................................................................................................ 78
ROC VS PR (examen) ..................................................................................................................................... 78

nearest neighbor classifier ................................................................................................................................ 79

regression ...................................................................................................................................................... 81

The regression problem ..................................................................................................................................... 81

simple linear regression .................................................................................................................................... 81

multiple linear regression .................................................................................................................................. 82
best fit ........................................................................................................................................................... 83
optimization problem ................................................................................................................................... 83
evaluation of the model ................................................................................................................................ 84

3

, non linear regression ......................................................................................................................................... 85

logisitc regression .............................................................................................................................................. 89

cox regression.................................................................................................................................................... 91

overfitting .......................................................................................................................................................... 91
How do we estimate the capacity of our model to overfit? ......................................................................... 91

speed and scalability ......................................................................................................................................... 93

interpretability .................................................................................................................................................. 93

robustness ......................................................................................................................................................... 94

feature selection................................................................................................................................................ 94
How do we mitigate the sensitivity to irrelevant features? .......................................................................... 94
Different methods feature selection ............................................................................................................. 95

regularized regression ....................................................................................................................................... 95
trade of between best fit, L1-norm and L2-norm ......................................................................................... 96

elastic net .......................................................................................................................................................... 87
common regularization regression approaches ............................................................................................ 88
examples ....................................................................................................................................................... 88

elastic net .............................................................................................................. Error! Bookmark not defined.
common regularization regression approaches ................................................ Error! Bookmark not defined.
examples ........................................................................................................... Error! Bookmark not defined.

machine learning methods ............................................................................................................................. 97

introduction ........................................................................................................... Error! Bookmark not defined.

classification ...................................................................................................................................................... 82
what do these methods have in common..................................................................................................... 83

decision trees..................................................................................................................................................... 97
how to build a deciscion tree ........................................................................................................................ 97
gini impurity .................................................................................................................................................. 97
example ......................................................................................................................................................... 99

random forests ................................................................................................................................................ 100
bootstrapping .............................................................................................................................................. 100
bagging ........................................................................................................................................................ 101
gini importance ........................................................................................................................................... 103
example of RF TCR binding .......................................................................................................................... 103
summary random forest ............................................................................................................................. 104

neural networks(examen) ............................................................................................................................... 104
single layer perceptron ............................................................................................................................... 105
training the neural network ........................................................................................................................ 107
disadvantages.............................................................................................................................................. 109

deep learning .................................................................................................................................................. 109

4

, applications deep learning .......................................................................................................................... 110

MPC ............................................................................................................................................................. 120

Exam question last year ............................................................................................................................... 120

sv practica .................................................................................................................................................... 125

automation...................................................................................................................................................... 125
theorie ......................................................................................................................................................... 125
oefening ...................................................................................................................................................... 125
new function ............................................................................................................................................... 126
lijst gebruiken .............................................................................................................................................. 126

reshaping......................................................................................................................................................... 127

multivariate data analysis ............................................................................................................................... 127
PCA .............................................................................................................................................................. 127
cluster analyse............................................................................................................................................. 128

machine learning............................................................................................................................................. 129
decision tree ................................................................................................................................................ 129
Random forest............................................................................................................................................. 129
roc curve: zien welke beter is ...................................................................................................................... 129

regularized regression ..................................................................................................................................... 129

typical R commands ..................................................................................................................................... 130

tabellen van excel naar tekst file naar R ....................................................................................................... 131

in Excel aanpassen .......................................................................................................................................... 131

tekst file ........................................................................................................................................................... 132

in R aanpassen ................................................................................................................................................ 132

tabellen(files) in R zetten en aanpassen ....................................................................................................... 132

export a graph in PDF ................................................................................................................................... 134

grafieken maken .......................................................................................................................................... 135

automation .................................................................................................................................................. 135

automation of repetitive analyses .................................................................................................................. 135
Oefening ...................................................................................................................................................... 137

automation with a new function ..................................................................................................................... 141
oefening: een nieuwe functie ..................................................................................................................... 141
oefening: gebruik van een lijst ................................................................................................................... 143
oefening: combinatie van for-loops, functions and lists ............................................................................. 144

5

,reshaping ..................................................................................................................................................... 144

oefening ...................................................................................................................................................... 145

multivariate data analysis ............................................................................................................................ 149

Principal Component Analysis: the hepathlon dataset ................................................................................... 149
oefening 2 PCA ............................................................................................................................................ 151
oefening 3 PCA ........................................................................................................................................... 152
oefening 4 PCA ............................................................................................................................................ 153
oefening 5 PCA ............................................................................................................................................ 154

cluster analysis: the wine dataset ................................................................................................................... 155

hierarchial cluster analysis .............................................................................................................................. 156
Q1: How many clusters would you expect, based upon the dendrogram? ................................................ 157
Q2: Is the clustering approximately in agreement with the origin of the wines (Note: zoom in to be able to
read the labels) ........................................................................................................................................... 157
Q3: As pointed out in the theory lesson, there are several ways to calculate the dissimilarity between
clusters, including : single linkage, complete linkage, average linkage and Ward linkage. These are also
referred to as “agglomeration methods”.................................................................................................... 157

partitional clustering ....................................................................................................................................... 157

Om zelf te doen ............................................................................................................................................... 158

machine learning.......................................................................................................................................... 161

supervised classification with decision trees and random forests................................................................... 161
the breast cancer dataset ........................................................................................................................... 161
decision tree ................................................................................................................................................ 162
random forest ............................................................................................................................................. 165
heart disease ............................................................................................................................................... 168
exercise: unsupervised methods ................................................................................................................. 170

regularized regression ..................................................................................................................................... 170
student data set .......................................................................................................................................... 170

6

,DATA ANALYSIS

INTRODUCTION

• Biomedical data within a multidisciplinary gland
o Look at data instead of the classical studies → few individuals measuring a single parameter
o Wide spectrum
• BIG DATA is data for which conventional, computer-techniques are not sufficient anymore due to size, complexity
• It is a disruptive trend
 Need different data mining techniques to acquire information, like AI
• BIG DATA is characterized by volume, velocity, variety and veracity
o Volume = size of data → collected everywhere
▪ Has become very cheap to acquire the data
▪ One of the biggest costs = data analysis (FASTQ file, etc.)
o Velocity = speed at which data is being generated= enormous
▪ Like a smartphone = location tracking, fit application, wifi, etc.
▪ At any given point in time → lots of data
o Variety = diversity data that 80% is heterogeneous and unstructured
o Veracity = trustworthiness
▪ Biggest problem
▪ How reliable
▪ Always can go wrong → can’t fully trust the data
• Mislabelling, etc.
• Needs to be excluded or we just need to deal with it
• DATA management gap = too much data to actually analyse it → needs to be shifted through
o Data is so rapid that we need satellites to connect places => need for more high tech data options
o Need to consider how much data is involved
• “Data science” → diverse opinion on how the definitions should be applied
• DATA is a collection of data objects (= samples) and their attributes
o Feature = Attribute = property/characteristic of an object → column (specific well-defined features)
▪ For example: eye colour, ID, location
▪ Discrete = geen kommagetal
• Eye color, house numbers
▪ Continuous = real numbers
• Temp, height, weight
o Object = collection of attributes → row
▪ Sample, individual
o Attribute values = numbers or symbols assigned to an attribute
▪ Eye color: blue, green, brown,…
▪ Nominal = eye color, sex, ID, zip codes
▪ Ordinal = height (tall, medium, short), grades →
higher is better (there is an order score)
▪ Interval = calendar dates, temperature
• No zero
▪ Ratio = temperature in kelvin, length
• True zero

7

, • Dataset types
o Record data = collection of records with objects and attributes
▪ Data matrix = objects same fixed set of numeric attributes
▪ Document data= document becomes a term vector
• Each term is an attribute of the vector and value of each term is the number of times
the corresponding term occurs in the document
• More empty then filled= sparce matrix
▪ Transaction data= each record involves a set of items
• Grocery list of different people
o Graph data= network that consists of notes and their interactions
▪ Organic chemistry
▪ Molecular
▪ Interation networks
o Ordered data: Molecular sequences , temporal data(climate information in space and time→
temperature, has a clear structure
▪ Not in our strict data type
▪ Fasta file, etc.
▪ Weather data, …

DATAMINING is converting extracted information into useful knowledge= discover meaningful patterns

• Non-trivial extraction of implicit, previously unknown and potentially useful information from data
• Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover
meaningful patterns
• Related to statistics (probability theory)
o Sounds like pattern, finding a good model and selecting a good clustering/classifier
o BUT much smaller and simpler data, finds associations between attributes, not values, not generate
hypotheses but verifies them → not find anything

Two main goals:
Description: data summarisation e.g. average, min/max values, empirical probabilities, etc. i.e. consider the data
Inference: extract information e.g. hypothesis testing, estimation, correlation analysis, etc. i.e. guess and estimate the
distribution underlying the data, use it to do stuﬀ

Challenges

- Scalability
- Dimensionality
- Complex and heterogeneous data
- Data quality
- Data ownership and distribution
- Privacy preservation
- Streaming data

8

,Garbage in = garbage out
• The importance of good data
o Good data will (likely) give you good results
o Everybody thinks their data is great
• If pattern is found, the expert might not like the result: who is to blame?
o The method? The expert?

MACHINE LEARNING= a subfield of artificial intelligence (AI) that focuses on the development of algorithms and models that
enable computers or machines to learn and make predictions or decisions without explicit programming.

• SUPERVISED CLASSIFICATION= Learning the patterns from the data → predict unknown value
o Predicts a label (classification) and predict a continuous attribute (regression)
o Step 1: extract features of dogs and cats
o Step 2: make a collection of objects: every dog or cat is a spot in space
o Step 3: new unknown observation in data
o Step 4: use decision boundary: weight to split up the 2 objects in space
o Methods to train model and estimate a decision boundry
▪ Support vector machine
▪ Decision tree
▪ Random forest
▪ Neural networks
o Workflow
▪ Give it example to learn from( supervised classification)
▪ Let it learn features and build a model (decision boundary)
▪ Once we have a model, we can use it to classify unknows
• UNSUPERVISED CLASSIFICATIONS= computer detects interpretable patterns that describe the data (no predefined
answer) = patterns in the dataset
• Maket basket analysis = finding a few patterns
o Hierarchical clustering
o Association rule analysis
o Principal component analysis
o Need “Smart algorithms” -> frequent pattern mining
o Can very rapidly find all possible patterns within specific criteria
o Such approaches will be discussed in detail later!

Outlier detection = we don’t know in advance what an outlier is → can’t really quantify (need an unsupervised method)

- Identification of an atypical sample or feature in a data set.
- Common first step in biomedical data analysis. E.g.- Data projection

9

, PROCESSING PRINCIPLES

• Starting material from where you initiate data mining= dirty data
o What you want= clean, normalized, structured, complete, non redundant, etc.
▪ Some techniques can deal with noise
▪ No duplicates = non redundant
o What you need= sample x feature matrix where each feature is a ratio or Boolean variable
▪ Pre-processing and transformation needed
▪ Unstructured data = no pred-defined structure
▪ Depends on the method we want to use
▪ Samples : rows
▪ Features : columns
• Processing steps (data set to data matrix)
o We want structured data
▪ Standardizes how data is related
▪ Determines structure
▪ Model can be represented in notation
▪ In a lot of cases do need to integrate different data types → need to be in a combined
representation
o Unstructured data
▪ No predefined structure
• Often txt-heavy, irregularities, …
• Need to find a way to extract knowledge
a. Feature extraction: Most data mining methods work best on numerical data matrices
i. Hopefully numerical, ratio features
ii. Take a data set and convert
iii. Data set of different patients with different blood types
iv. Simply text data → doesn’t understand differences so neeeed to extract features that captue
what we want to learn → define features like is the blood type resus + → true = 1 false = 0
(numerical feature that the data mining features can understand)
v. By defining the features we have captured all the features that we want
b. Attribute transformation
i. A function that maps the entire set of values of a given attribute to a new set of replacement
values such that each value can be identified with one of the new values
ii. Converting temperature from F to Kelvin
iii. Log transformation = monotonic (order of the values doesn’t change) → makes it more normal
→ can apply some of the more standard statistical methods → improves linearity as it
transforms multiplicative effects in to additive effects. (why the log is quite usefull, prior to
acquiring data methods.)
iv. Z- normalization = subtract the mean and set std to 1 also monotonic (if we have a value of 1
than we can compare fe both above mean and both with std)
v. IQR normalization = interquartile normalization → especially if we have .. → divide by the
median = can’t guarantee that the mean is zero
vi. Mapping data to a new space = fourier and wavelet
c. Discretization
i. A process of converting or partitioning continuous attributes, features or variables to
discretized or nominal attributes, features, variables or intervals
ii. Purpose:
• Noise reduction
• Focus on relevant intervals

10

$23.54

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

biomed124

4.0

(2)

Documento también disponible en un lote

Reseñas de compradores verificados

Se muestran los comentarios

benjamin-vanlaer Biomedische Wetenschappen · 6 reseñas

8 meses hace

4.0

1 reseñas

Reseñas confiables sobre Stuvia

Todas las reseñas las realizan usuarios reales de Stuvia después de compras verificadas.

Conoce al vendedor

biomed124 Universiteit Antwerpen

Ver perfil

Seguir

Vendido

Miembro desde

4 año

Número de seguidores

Documentos

Última venta

1 semana hace

biomed

4.0

2 reseñas

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller biomed124. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $23.54. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 16 years now