INHOUDSOPGAVE
data analysis .................................................................................................................................................... 7
introduction ......................................................................................................................................................... 7
processing principles ......................................................................................................................................... 10
unsupervised clustering ..................................................................................................................................... 13
principal component analysis ............................................................................................................................ 14
supervised learning ........................................................................................................................................... 15
regression .......................................................................................................................................................... 15
machine learning methods ................................................................................................................................ 17
introduction ................................................................................................................................................... 19
a bit of context .................................................................................................................................................. 19
Big data ......................................................................................................................................................... 19
what is data? ..................................................................................................................................................... 20
attribute values ............................................................................................................................................. 21
attribute types and properties ...................................................................................................................... 21
Dataset types .................................................................................................................................................... 22
record data .................................................................................................................................................... 22
graph data ..................................................................................................................................................... 23
ordered data ................................................................................................................................................. 23
what is data mining? ......................................................................................................................................... 23
statistics ........................................................................................................................................................ 24
challenges data mining ................................................................................................................................. 25
tasks .................................................................................................................................................................. 25
o ‘learning’ the patterns from the data ........................................................................................................ 25
o ‘discover’ the patterns in the data ........................................................................................................... 25
supervised classification................................................................................................................................ 25
unsupervised classification ........................................................................................................................... 27
AI, what does it mean now? .......................................................................................................................... 28
processing principles ...................................................................................................................................... 29
introduction ....................................................................................................................................................... 29
Unstructured data ............................................................................................................................................. 29
common data processing steps ......................................................................................................................... 29
feature extraction ......................................................................................................................................... 29
attribute transformation ............................................................................................................................... 30
1
, discretization ................................................................................................................................................. 30
Aggregation ................................................................................................................................................... 31
Noise removal ............................................................................................................................................... 31
Outlier removal ............................................................................................................................................. 32
Sampling ........................................................................................................................................................ 32
Handling duplicate data: data clean up ........................................................................................................ 33
Handling missing values ................................................................................................................................ 33
Dimensionality reduction .............................................................................................................................. 34
processing step for specific data types .............................................................................................................. 35
image data..................................................................................................................................................... 35
Survey data.................................................................................................................................................... 35
sequence data ............................................................................................................................................... 35
Network data ................................................................................................................................................ 36
Text data ....................................................................................................................................................... 36
Omics data .................................................................................................................................................... 37
Chapter 3: univariate techniques ................................................................................................................... 41
functional analysis of large data sets ................................................................................................................ 45
chapter 4: unsupervised clustering ................................................................................................................. 48
unsupervised versus supervised ........................................................................................................................ 48
clustering (examen) ........................................................................................................................................... 48
what is clustering? ........................................................................................................................................ 48
similarity ............................................................................................................................................................ 49
defining distance measures .......................................................................................................................... 49
what properties should a distance measure have? ...................................................................................... 49
Generic Technique – transformation distance / Edit distance ...................................................................... 49
dendograms( examen) ...................................................................................................................................... 50
a demonstration of hierarchial clustering using string edit distance ............................................................ 50
hierarchical clustering (examen) ...................................................................................................................... 50
bottom-up ..................................................................................................................................................... 51
methods to calculate distance between 2 clusters/ object and cluster ....................................................... 51
partitional clustering ......................................................................................................................................... 54
how many clusters (k)? ................................................................................................................................. 54
CHApter 5: principial component analysis= data projection ........................................................................... 57
introduction ....................................................................................................................................................... 57
multivariate data .............................................................................................................................................. 57
basic variable statistics: represent this data ................................................................................................. 57
data transformation .......................................................................................................................................... 57
normalization .................................................................................................................................................... 58
2
, comparison between variables ......................................................................................................................... 58
covariance ..................................................................................................................................................... 58
correlation= normalised version of covariance ............................................................................................. 58
data projection .................................................................................................................................................. 59
geometric interpretation .............................................................................................................................. 59
why use projections ...................................................................................................................................... 59
how PCA works .................................................................................................................................................. 60
loadings ......................................................................................................................................................... 61
scores ............................................................................................................................................................ 62
scree plot= variance in each principial component ...................................................................................... 62
example possum ........................................................................................................................................... 63
example nutrition .......................................................................................................................................... 64
influenza PCA ................................................................................................................................................ 66
metagenomics: enterotypes ......................................................................................................................... 66
t-SNE.................................................................................................................................................................. 67
how does it work ........................................................................................................................................... 68
Perplexity ...................................................................................................................................................... 69
chapter 6: supervised learning ....................................................................................................................... 70
the classification problem ................................................................................................................................. 70
the grasshopper problem .................................................................................................................................. 70
compile data set ............................................................................................................................................ 70
regression vs classification ................................................................................................................................ 71
linear classifier .............................................................................................................................................. 71
support vector machines svm ....................................................................................................................... 73
descision value .............................................................................................................................................. 73
predicitve accuracy ....................................................................................................................................... 74
confusion matrix (examen) ........................................................................................................................... 74
treshold and accuracy ................................................................................................................................... 76
ROC-curve (examen) ..................................................................................................................................... 77
PR-curve ........................................................................................................................................................ 78
ROC VS PR (examen) ..................................................................................................................................... 78
nearest neighbor classifier ................................................................................................................................ 79
regression ...................................................................................................................................................... 81
The regression problem ..................................................................................................................................... 81
simple linear regression .................................................................................................................................... 81
multiple linear regression .................................................................................................................................. 82
best fit ........................................................................................................................................................... 83
optimization problem ................................................................................................................................... 83
evaluation of the model ................................................................................................................................ 84
3
, non linear regression ......................................................................................................................................... 85
logisitc regression .............................................................................................................................................. 89
cox regression.................................................................................................................................................... 91
overfitting .......................................................................................................................................................... 91
How do we estimate the capacity of our model to overfit? ......................................................................... 91
speed and scalability ......................................................................................................................................... 93
interpretability .................................................................................................................................................. 93
robustness ......................................................................................................................................................... 94
feature selection................................................................................................................................................ 94
How do we mitigate the sensitivity to irrelevant features? .......................................................................... 94
Different methods feature selection ............................................................................................................. 95
regularized regression ....................................................................................................................................... 95
trade of between best fit, L1-norm and L2-norm ......................................................................................... 96
elastic net .......................................................................................................................................................... 87
common regularization regression approaches ............................................................................................ 88
examples ....................................................................................................................................................... 88
elastic net .............................................................................................................. Error! Bookmark not defined.
common regularization regression approaches ................................................ Error! Bookmark not defined.
examples ........................................................................................................... Error! Bookmark not defined.
machine learning methods ............................................................................................................................. 97
introduction ........................................................................................................... Error! Bookmark not defined.
classification ...................................................................................................................................................... 82
what do these methods have in common..................................................................................................... 83
decision trees..................................................................................................................................................... 97
how to build a deciscion tree ........................................................................................................................ 97
gini impurity .................................................................................................................................................. 97
example ......................................................................................................................................................... 99
random forests ................................................................................................................................................ 100
bootstrapping .............................................................................................................................................. 100
bagging ........................................................................................................................................................ 101
gini importance ........................................................................................................................................... 103
example of RF TCR binding .......................................................................................................................... 103
summary random forest ............................................................................................................................. 104
neural networks(examen) ............................................................................................................................... 104
single layer perceptron ............................................................................................................................... 105
training the neural network ........................................................................................................................ 107
disadvantages.............................................................................................................................................. 109
deep learning .................................................................................................................................................. 109
4
, applications deep learning .......................................................................................................................... 110
MPC ............................................................................................................................................................. 120
Exam question last year ............................................................................................................................... 120
sv practica .................................................................................................................................................... 125
automation...................................................................................................................................................... 125
theorie ......................................................................................................................................................... 125
oefening ...................................................................................................................................................... 125
new function ............................................................................................................................................... 126
lijst gebruiken .............................................................................................................................................. 126
reshaping......................................................................................................................................................... 127
multivariate data analysis ............................................................................................................................... 127
PCA .............................................................................................................................................................. 127
cluster analyse............................................................................................................................................. 128
machine learning............................................................................................................................................. 129
decision tree ................................................................................................................................................ 129
Random forest............................................................................................................................................. 129
roc curve: zien welke beter is ...................................................................................................................... 129
regularized regression ..................................................................................................................................... 129
typical R commands ..................................................................................................................................... 130
tabellen van excel naar tekst file naar R ....................................................................................................... 131
in Excel aanpassen .......................................................................................................................................... 131
tekst file ........................................................................................................................................................... 132
in R aanpassen ................................................................................................................................................ 132
tabellen(files) in R zetten en aanpassen ....................................................................................................... 132
export a graph in PDF ................................................................................................................................... 134
grafieken maken .......................................................................................................................................... 135
automation .................................................................................................................................................. 135
automation of repetitive analyses .................................................................................................................. 135
Oefening ...................................................................................................................................................... 137
automation with a new function ..................................................................................................................... 141
oefening: een nieuwe functie ..................................................................................................................... 141
oefening: gebruik van een lijst ................................................................................................................... 143
oefening: combinatie van for-loops, functions and lists ............................................................................. 144
5
,reshaping ..................................................................................................................................................... 144
oefening ...................................................................................................................................................... 145
multivariate data analysis ............................................................................................................................ 149
Principal Component Analysis: the hepathlon dataset ................................................................................... 149
oefening 2 PCA ............................................................................................................................................ 151
oefening 3 PCA ........................................................................................................................................... 152
oefening 4 PCA ............................................................................................................................................ 153
oefening 5 PCA ............................................................................................................................................ 154
cluster analysis: the wine dataset ................................................................................................................... 155
hierarchial cluster analysis .............................................................................................................................. 156
Q1: How many clusters would you expect, based upon the dendrogram? ................................................ 157
Q2: Is the clustering approximately in agreement with the origin of the wines (Note: zoom in to be able to
read the labels) ........................................................................................................................................... 157
Q3: As pointed out in the theory lesson, there are several ways to calculate the dissimilarity between
clusters, including : single linkage, complete linkage, average linkage and Ward linkage. These are also
referred to as “agglomeration methods”.................................................................................................... 157
partitional clustering ....................................................................................................................................... 157
Om zelf te doen ............................................................................................................................................... 158
machine learning.......................................................................................................................................... 161
supervised classification with decision trees and random forests................................................................... 161
the breast cancer dataset ........................................................................................................................... 161
decision tree ................................................................................................................................................ 162
random forest ............................................................................................................................................. 165
heart disease ............................................................................................................................................... 168
exercise: unsupervised methods ................................................................................................................. 170
regularized regression ..................................................................................................................................... 170
student data set .......................................................................................................................................... 170
6
,DATA ANALYSIS
INTRODUCTION
• Biomedical data within a multidisciplinary gland
o Look at data instead of the classical studies → few individuals measuring a single parameter
o Wide spectrum
• BIG DATA is data for which conventional, computer-techniques are not sufficient anymore due to size, complexity
• It is a disruptive trend
Need different data mining techniques to acquire information, like AI
• BIG DATA is characterized by volume, velocity, variety and veracity
o Volume = size of data → collected everywhere
▪ Has become very cheap to acquire the data
▪ One of the biggest costs = data analysis (FASTQ file, etc.)
o Velocity = speed at which data is being generated= enormous
▪ Like a smartphone = location tracking, fit application, wifi, etc.
▪ At any given point in time → lots of data
o Variety = diversity data that 80% is heterogeneous and unstructured
o Veracity = trustworthiness
▪ Biggest problem
▪ How reliable
▪ Always can go wrong → can’t fully trust the data
• Mislabelling, etc.
• Needs to be excluded or we just need to deal with it
• DATA management gap = too much data to actually analyse it → needs to be shifted through
o Data is so rapid that we need satellites to connect places => need for more high tech data options
o Need to consider how much data is involved
• “Data science” → diverse opinion on how the definitions should be applied
• DATA is a collection of data objects (= samples) and their attributes
o Feature = Attribute = property/characteristic of an object → column (specific well-defined features)
▪ For example: eye colour, ID, location
▪ Discrete = geen kommagetal
• Eye color, house numbers
▪ Continuous = real numbers
• Temp, height, weight
o Object = collection of attributes → row
▪ Sample, individual
o Attribute values = numbers or symbols assigned to an attribute
▪ Eye color: blue, green, brown,…
▪ Nominal = eye color, sex, ID, zip codes
▪ Ordinal = height (tall, medium, short), grades →
higher is better (there is an order score)
▪ Interval = calendar dates, temperature
• No zero
▪ Ratio = temperature in kelvin, length
• True zero
7
, • Dataset types
o Record data = collection of records with objects and attributes
▪ Data matrix = objects same fixed set of numeric attributes
▪ Document data= document becomes a term vector
• Each term is an attribute of the vector and value of each term is the number of times
the corresponding term occurs in the document
• More empty then filled= sparce matrix
▪ Transaction data= each record involves a set of items
• Grocery list of different people
o Graph data= network that consists of notes and their interactions
▪ Organic chemistry
▪ Molecular
▪ Interation networks
o Ordered data: Molecular sequences , temporal data(climate information in space and time→
temperature, has a clear structure
▪ Not in our strict data type
▪ Fasta file, etc.
▪ Weather data, …
DATAMINING is converting extracted information into useful knowledge= discover meaningful patterns
• Non-trivial extraction of implicit, previously unknown and potentially useful information from data
• Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover
meaningful patterns
• Related to statistics (probability theory)
o Sounds like pattern, finding a good model and selecting a good clustering/classifier
o BUT much smaller and simpler data, finds associations between attributes, not values, not generate
hypotheses but verifies them → not find anything
Two main goals:
Description: data summarisation e.g. average, min/max values, empirical probabilities, etc. i.e. consider the data
Inference: extract information e.g. hypothesis testing, estimation, correlation analysis, etc. i.e. guess and estimate the
distribution underlying the data, use it to do stuff
Challenges
- Scalability
- Dimensionality
- Complex and heterogeneous data
- Data quality
- Data ownership and distribution
- Privacy preservation
- Streaming data
8
,Garbage in = garbage out
• The importance of good data
o Good data will (likely) give you good results
o Everybody thinks their data is great
• If pattern is found, the expert might not like the result: who is to blame?
o The method? The expert?
MACHINE LEARNING= a subfield of artificial intelligence (AI) that focuses on the development of algorithms and models that
enable computers or machines to learn and make predictions or decisions without explicit programming.
• SUPERVISED CLASSIFICATION= Learning the patterns from the data → predict unknown value
o Predicts a label (classification) and predict a continuous attribute (regression)
o Step 1: extract features of dogs and cats
o Step 2: make a collection of objects: every dog or cat is a spot in space
o Step 3: new unknown observation in data
o Step 4: use decision boundary: weight to split up the 2 objects in space
o Methods to train model and estimate a decision boundry
▪ Support vector machine
▪ Decision tree
▪ Random forest
▪ Neural networks
o Workflow
▪ Give it example to learn from( supervised classification)
▪ Let it learn features and build a model (decision boundary)
▪ Once we have a model, we can use it to classify unknows
• UNSUPERVISED CLASSIFICATIONS= computer detects interpretable patterns that describe the data (no predefined
answer) = patterns in the dataset
• Maket basket analysis = finding a few patterns
o Hierarchical clustering
o Association rule analysis
o Principal component analysis
o Need “Smart algorithms” -> frequent pattern mining
o Can very rapidly find all possible patterns within specific criteria
o Such approaches will be discussed in detail later!
Outlier detection = we don’t know in advance what an outlier is → can’t really quantify (need an unsupervised method)
- Identification of an atypical sample or feature in a data set.
- Common first step in biomedical data analysis. E.g.- Data projection
9
, PROCESSING PRINCIPLES
• Starting material from where you initiate data mining= dirty data
o What you want= clean, normalized, structured, complete, non redundant, etc.
▪ Some techniques can deal with noise
▪ No duplicates = non redundant
o What you need= sample x feature matrix where each feature is a ratio or Boolean variable
▪ Pre-processing and transformation needed
▪ Unstructured data = no pred-defined structure
▪ Depends on the method we want to use
▪ Samples : rows
▪ Features : columns
• Processing steps (data set to data matrix)
o We want structured data
▪ Standardizes how data is related
▪ Determines structure
▪ Model can be represented in notation
▪ In a lot of cases do need to integrate different data types → need to be in a combined
representation
o Unstructured data
▪ No predefined structure
• Often txt-heavy, irregularities, …
• Need to find a way to extract knowledge
a. Feature extraction: Most data mining methods work best on numerical data matrices
i. Hopefully numerical, ratio features
ii. Take a data set and convert
iii. Data set of different patients with different blood types
iv. Simply text data → doesn’t understand differences so neeeed to extract features that captue
what we want to learn → define features like is the blood type resus + → true = 1 false = 0
(numerical feature that the data mining features can understand)
v. By defining the features we have captured all the features that we want
b. Attribute transformation
i. A function that maps the entire set of values of a given attribute to a new set of replacement
values such that each value can be identified with one of the new values
ii. Converting temperature from F to Kelvin
iii. Log transformation = monotonic (order of the values doesn’t change) → makes it more normal
→ can apply some of the more standard statistical methods → improves linearity as it
transforms multiplicative effects in to additive effects. (why the log is quite usefull, prior to
acquiring data methods.)
iv. Z- normalization = subtract the mean and set std to 1 also monotonic (if we have a value of 1
than we can compare fe both above mean and both with std)
v. IQR normalization = interquartile normalization → especially if we have .. → divide by the
median = can’t guarantee that the mean is zero
vi. Mapping data to a new space = fourier and wavelet
c. Discretization
i. A process of converting or partitioning continuous attributes, features or variables to
discretized or nominal attributes, features, variables or intervals
ii. Purpose:
• Noise reduction
• Focus on relevant intervals
10