Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien 4.2 TrustPilot
logo-home
Notes de cours

Notes Advanced Data Analysis

Note
-
Vendu
2
Pages
131
Publié le
19-02-2024
Écrit en
2022/2023

This document consists of college notes from the theory lessons supplemented with the explanatory figures and additional information. Therefore, it contains all theory that should be studied for the exam except the practicals.












Oups ! Impossible de charger votre document. Réessayez ou contactez le support.

Infos sur le Document

Publié le
19 février 2024
Nombre de pages
131
Écrit en
2022/2023
Type
Notes de cours
Professeur(s)
Kris laukens
Contient
Toutes les classes

Aperçu du contenu

Inhoudsopgave

Chapter 1: Introduction.................................................................................................................................... 4
1.1: Introduction .......................................................................................................................................... 4
o Before we start............................................................................................................................... 4
§ A few practical things ................................................................................................................. 4
® Background ........................................................................................................................... 4
o A bit of context............................................................................................................................... 4
§ Big data ..................................................................................................................................... 4
® Definition of big data ............................................................................................................. 5
® Big data is characterized by: .................................................................................................. 5
® Large scale data and AI brought a new data intensive research paradigm .............................. 8
§ What is data? Some definitions of what we are dealing with and how we can represent it?........ 8
® Data can be given by objects and attributes ........................................................................... 8
a) Data object....................................................................................................................... 9
b) Attribute .......................................................................................................................... 9
® Dataset types ...................................................................................................................... 10
a) Record:........................................................................................................................... 10
b) Graph: ............................................................................................................................ 11
c) Ordered:......................................................................................................................... 11
§ Data mining ............................................................................................................................. 12
® What is data mining? ........................................................................................................... 12
® Examples: Is it data mining?................................................................................................. 13
® Data mining challenges........................................................................................................ 13
® Major tasks of data mining (after preprocessing) ................................................................. 14
1) Supervised data mining ................................................................................................... 14
2) Unsupervised data mining ............................................................................................... 17
® Data mining is business ....................................................................................................... 18
® Value of data ....................................................................................................................... 19
® Evolution............................................................................................................................. 19

Chapter 2: Processing principles..................................................................................................................... 20
2.1: Processing principles............................................................................................................................ 20
o Introduction ................................................................................................................................. 20
§ What you usually have vs. what you want and need ................................................................. 20
® In reality you usually have ‘dirty data’ .................................................................................. 20
® Data that you actually want/need is: ................................................................................... 20
o Pre-processing and transformation à to get more minable data that can be further used ............ 20
§ Role of pre-processing and transformation............................................................................... 20
® Unstructured data ............................................................................................................... 20
® Common data processing steps that each make data more ready for data mining ................ 21
a) Feature extraction:......................................................................................................... 21
b) Attribute transformation = feature transformation ........................................................ 21
c) Discretization ................................................................................................................. 22
d) Aggregation.................................................................................................................... 22
e) Noise removal ................................................................................................................ 22
f) Identifying outliers à outlier removal ........................................................................... 23
g) Sampling ........................................................................................................................ 23
h) Handling duplicated data ............................................................................................... 24
i) Handling missing values ................................................................................................. 24
j) Dimensionality reduction ............................................................................................... 25
® Processing steps for specific data types: what types of features are we dealing with? .......... 29



1

, a) Image data: .................................................................................................................... 29
b) Survey data .................................................................................................................... 30
c) Sequence data................................................................................................................ 31
d) Text data ........................................................................................................................ 32
e) Omics data ..................................................................................................................... 32
f) Temporal........................................................................................................................ 38

Chapter 3: Unsupervised clustering................................................................................................................ 39
3.1: Unsupervised clustering ....................................................................................................................... 39
o Introduction ................................................................................................................................. 39
§ Unsupervised vs. supervised .................................................................................................... 39
® Quick overview in difference between supervised and unsupervised ................................... 39
§ Clustering ................................................................................................................................ 39
® What is clustering? .............................................................................................................. 39
® Exists in different domains and has different names but it does something quite similar ...... 39
® Natural grouping ................................................................................................................. 39
§ Similarity ................................................................................................................................. 40
® Wat is similarity? ................................................................................................................. 40
® Defining distance measures ................................................................................................. 40
® How do we measure similarity? ........................................................................................... 41
§ Dendrograms ........................................................................................................................... 42
® What is it? ........................................................................................................................... 42
® Example .............................................................................................................................. 42
® Use of dendrograms ............................................................................................................ 44
§ Algorithms ............................................................................................................................... 44
o 2 types of clustering ..................................................................................................................... 45
§ Hierarchical clustering ............................................................................................................. 45
® Principle: ............................................................................................................................. 45
® Heuristic search (= a more practical feasible way come up with the best dendrogram but
without forgetting that there are multiple options out there) ....................................................... 45
à Since we cannot test all possible trees we will have to heuristic search of all possible trees. We
could do this bottom-up or top-down. .......................................................................................... 45
à use a heuristic search à we cannot guarantee we get the optimal solution, but way faster than
testing every option ..................................................................................................................... 45
® How to measure the distance between 2 clusters based on the distance function? .............. 46
§ Partitional clustering ............................................................................................................... 50
® What is it? ........................................................................................................................... 50
® How many clusters? à how to specify k? ............................................................................ 50
® K-means steps (simple & efficient algorithm) ....................................................................... 51
® Importance of choosing initial centroids .............................................................................. 53
® Weakness of k-means.......................................................................................................... 53

Chapter 4: Principal component analysis (PCA) .............................................................................................. 54
4.1: Principal component analysis (PCA) ..................................................................................................... 54
o PCA as the backbone of modern data analysis .............................................................................. 54
§ What is principal component analysis and why is it necessary?................................................. 54
® PCA is the first thing you do when you get a new dataset..................................................... 54
® Reasons to do PCA:.............................................................................................................. 54
® Multivariate data................................................................................................................. 54
§ Important concepts.................................................................................................................. 55
® Basic variable statistics ........................................................................................................ 55
a) Mean .............................................................................................................................. 55
b) Median ........................................................................................................................... 56
c) Range ............................................................................................................................. 56
d) Variance ......................................................................................................................... 56


2

, e) Standard deviation.......................................................................................................... 56
® Data transformation ............................................................................................................ 56
2) Comparing variables ................................................................................................................. 57
o How does PCA work? .................................................................................................................... 58
§ Data projection ........................................................................................................................ 58
® Too many variables ............................................................................................................. 58
® What’s data projection? ...................................................................................................... 59
® Why use projections? .......................................................................................................... 59
® Data visualization and simplification à data projection should capture as much of the
information as possible ................................................................................................................ 60
® Geometric interpretation of PCA ......................................................................................... 60
® PCA output: IMPORTANT for the exam to interpret output ! ................................................ 62
® PCA usage: scores and loadings ........................................................................................... 64
® PCA examples...................................................................................................................... 64
§ t-SNE ..................................................................................... Fout! Bladwijzer niet gedefinieerd.
® = alternative method for data projection ............................................................................. 71
® How? .................................................................................................................................. 72
® Comparison PCA and t-SNE .................................................................................................. 74
® Perplexity ............................................................................................................................ 74
® Example: t-SNE for single cell RNAseq .................................................................................. 74

Chapter 5: Supervised learning ...................................................................................................................... 76
5.1: Supervised learning ............................................................................................................................. 76
o Introduction ................................................................................................................................. 76
§ Classification problem = problem we have a lot of experience with .......................................... 76
® Use features of an object to assign a hopefully correct label to an object ............................. 76
® Pigeon problems: training pigeons to classify paintings ........................................................ 76
® Grasshopper problem: Given a collection of annotated data. In this case 5 Katydids and 5
Grasshoppers, decide what type of insect the unlabeled example is (2 similar, but not identical
animals) ....................................................................................................................................... 76
o Regression vs. classification .......................................................................................................... 78
§ General.................................................................................................................................... 78
® Differences.......................................................................................................................... 78
§ Classification............................................................................................................................ 78
a) Simple linear classifier.................................................................................................... 78
® General: what is a simple linear classifier? ........................................................................... 78
® Support vector machines (SVM)........................................................................................... 82
® Decision value ..................................................................................................................... 83
® Predictive accuracy.............................................................................................................. 84
® Confusion matrix = matrix that fits all of the samples with the classified label vs. the true label
85
® Thresholds and accuracy ..................................................................................................... 86
® ROC and PR curves .............................................................................................................. 87
b) Nearest neighbor classifier ............................................................................................. 90
® What is this type of classifier? ............................................................................................. 90

Chapter 6: Regression .................................................................................................................................... 93
6.1: Regression ........................................................................................................................................... 93
o Regression = a supervised machine learning (ML) model and can be used to analyze multivariate
data (in data science you often need to deal with regression problems BUT this is different from ‘normal’
statistics) ............................................................................................................................................... 93
§ The regression problem ........................................................................................................... 93
® Given a collection of annotated data (in this case a number of insects with their ages), you
need to try to predict a variable about the data ............................................................................ 93
§ Regression vs. classification...................................................................................................... 94


3

, ® Classification....................................................................................................................... 94
® Regression .......................................................................................................................... 94
§ Types of regression .................................................................................................................. 94
® Simple linear regression...................................................................................................... 94
® Multiple linear regression ................................................................................................... 95
® Non-linear regression ......................................................................................................... 98
® Logistic regression .............................................................................................................. 98
® Cox regression .................................................................................................................... 99
® Regularized regression ...................................................................................................... 100
§ Considerations that need to be made with regression ............................................................ 103
® Overfitting......................................................................................................................... 103
- Intuitively we would say 9 ................................................................................................. 103
a) K-fold cross validation .................................................................................................. 104
b) Leave one-out cross validation (CV) = special case of K-fold cross validation when K =
number of samples ................................................................................................................ 105
® Speed and scalability ......................................................................................................... 105
® Interpretability à model interpretability is really important and leads to model transparency
105
® Robustness........................................................................................................................ 106

Chapter 7: Machine learning methods ......................................................................................................... 108
7.1: Machine learning methods ................................................................................................................ 108
o Supervised machine learning methods........................................................................................ 108
§ Recap .................................................................................................................................... 108
® Supervised vs. unsupervised .............................................................................................. 109
§ Classification.......................................................................................................................... 109
® Classification ..................................................................................................................... 109
® Classification algorithms .................................................................................................... 109
a) Support vector machines.............................................................................................. 110
b) Decision trees............................................................................................................... 110
c) Random forest ............................................................................................................. 114
d) Neural networks (NN) and deep learning ...................................................................... 119
e) K-nearest neighbors ..................................................... Fout! Bladwijzer niet gedefinieerd.



Chapter 1: Introduction

1.1: Introduction
• Introduction
o Before we start
§ A few practical things
® Background
¨ Background on bioinformatics, statistics, omics data analysis (NGS,
microarrays, …), data mining and machine learning
o A bit of context
§ Big data
® What is big data?
¨ In the last 5 decades there has been an evolution of the human system:
from seeing the human body from multi-disciplinary perspectives to the
human system as a complex interplay between genes, proteins, small
molecules, … that interact with each other in a very complex way and



4
€18,49
Accéder à l'intégralité du document:

Garantie de satisfaction à 100%
Disponible immédiatement après paiement
En ligne et en PDF
Tu n'es attaché à rien

Faites connaissance avec le vendeur
Seller avatar
jentebeeldens1

Faites connaissance avec le vendeur

Seller avatar
jentebeeldens1 Universiteit Antwerpen
Voir profil
S'abonner Vous devez être connecté afin de suivre les étudiants ou les cours
Vendu
2
Membre depuis
1 année
Nombre de followers
2
Documents
1
Dernière vente
1 année de cela
Biomedische Wetenschappen

Notities, samenvattingen, practicumnota's, ...

0,0

0 revues

5
0
4
0
3
0
2
0
1
0

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Une qualité sur laquelle compter : rédigé par des étudiants qui ont réussi et évalué par d'autres qui ont utilisé ce document.

Le document ne convient pas ? Choisis un autre document

Aucun souci ! Tu peux sélectionner directement un autre document qui correspond mieux à ce que tu cherches.

Paye comme tu veux, apprends aussitôt

Aucun abonnement, aucun engagement. Paye selon tes habitudes par carte de crédit et télécharge ton document PDF instantanément.

Student with book image

“Acheté, téléchargé et réussi. C'est aussi simple que ça.”

Alisha Student

Foire aux questions