Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien 4.2 TrustPilot
logo-home
Resume

Summary theory data mining

Note
-
Vendu
1
Pages
75
Publié le
29-05-2025
Écrit en
2024/2025

This is a summary of all the theory handouts of the course data mining. It contains information present on the slides and my own notes. Lessons that are present in this summary are: introduction, data processing, univariate techniques, unsupervised clustering, data projection, linear models and processing omics. There is also a table of contents in the beginning to keep a clear overview during the open book exam.

Montrer plus Lire moins













Oups ! Impossible de charger votre document. Réessayez ou contactez le support.

Infos sur le Document

Publié le
29 mai 2025
Nombre de pages
75
Écrit en
2024/2025
Type
Resume

Aperçu du contenu

Table of Contents

Chapter 1: Introduction................................................................................................................. 5
1.1. Big data ............................................................................................................................... 5
1.1.1 Volume ......................................................................................................................... 5
1.1.2 Velocity ......................................................................................................................... 5
1.1.3 Variety .......................................................................................................................... 5
1.1.4 Veracity......................................................................................................................... 6
1.2 What is data? ....................................................................................................................... 6
1.2.1 Feature values............................................................................................................... 6
1.2.2 Feature types ................................................................................................................ 7
1.2.3 Properties of features .................................................................................................... 7
1.2.4 Discrete vs. continuous ................................................................................................. 7
1.2.5 Dataset types ................................................................................................................ 8
1.2. Data mining ......................................................................................................................... 9
1.2.6 Is it data mining? ........................................................................................................... 9
1.2.7 Data mining is related to statistics .................................................................................. 9
1.2.8 Data mining challenges ............................................................................................... 10
1.2.9 Garbage in = garbage out ............................................................................................. 10
1.3 Tasks ..................................................................................................................................11
1.3.1 Two classes of techniques ........................................................................................... 11
1.3.2 Overview molecular applications ................................................................................. 13

Chapter 2: Processing principles .................................................................................................. 14
2.1. Structured data ...................................................................................................................14
2.2. Unstructured data ...............................................................................................................14
2.3. Common data processing steps ...........................................................................................14
2.3.1. Feature extraction ....................................................................................................... 14
2.3.2. Attribute/feature transformation .................................................................................. 15
2.3.3. Discretization .............................................................................................................. 16
2.3.4. Aggregation ................................................................................................................. 16
2.3.5. Noise removal ............................................................................................................. 17
2.3.6. Outlier removal ........................................................................................................... 17
2.3.7. Sampling .................................................................................................................... 17
2.3.8. Handling duplicate data............................................................................................... 18
2.3.9. Handling missing values .............................................................................................. 18
2.3.10. Dimensionality reduction ............................................................................................. 19
2.4. Processing steps for specific data types ...............................................................................20
2.4.1. Images ........................................................................................................................ 20
2.4.2. Surveys ....................................................................................................................... 20
2.4.3. Sequences .................................................................................................................. 21
2.4.4. Structure data ............................................................................................................. 21
2.4.5. Text data ..................................................................................................................... 22

Chapter 3: Univariate techniques ................................................................................................. 23
3.1. DiGerential analysis .............................................................................................................23
3.1.1. Hypothesis testing....................................................................................................... 23

1

, 3.1.2. t-distribution ............................................................................................................... 24
3.1.3. Central limit theorem................................................................................................... 24
3.1.4. Negative binomial ....................................................................................................... 25
3.2. Multivariate data .................................................................................................................25
3.2.1. What is the distribution of p-values .............................................................................. 25
3.2.2. QQ plot ....................................................................................................................... 25
3.2.3. Multiple testing correction ........................................................................................... 26
3.2.4. GWAS ......................................................................................................................... 28
3.2.5. Statistical test ............................................................................................................. 28
3.3. Functional analysis of large data sets ...................................................................................29
3.3.1. Introduction ................................................................................................................ 29
3.3.2. Overrepresentation analysis (ORA) ............................................................................... 29
3.3.3. Gene set enrichment analysis (GSEA) ........................................................................... 30

Chapter 4: Unsupervised clustering.............................................................................................. 32
4.1. Introduction ........................................................................................................................32
4.1.1. Clustering ................................................................................................................... 32
4.1.2. Similarity .................................................................................................................... 32
4.1.3. Dendrograms (slide 16-23) ........................................................................................... 34
4.1.4. Algorithms .................................................................................................................. 34
4.2. Hierarchical clustering.........................................................................................................35
4.2.1. Single linkage .............................................................................................................. 36
4.2.2. Complete linkage ........................................................................................................ 36
4.2.3. Group average linkage ................................................................................................. 36
4.2.4. Wards linkage.............................................................................................................. 37
4.3. Partitional clustering............................................................................................................37
4.3.1. How to tell right number of clusters? ............................................................................ 37
4.3.2. Objective function: squared error (slide 57) .................................................................. 38
4.3.3. K-means steps ............................................................................................................ 38

Chapter 5: Principal component analysis ..................................................................................... 41
5.1. Multivariate data .................................................................................................................41
5.1.1. Basic variable statistics ............................................................................................... 41
5.1.2. Data transformation .................................................................................................... 42
5.1.3. Normalization ............................................................................................................. 42
5.1.4. Comparison between variables .................................................................................... 42
5.2. Data projection ...................................................................................................................44
5.2.1. What is a projection? ................................................................................................... 44
5.2.2. Why use projections? .................................................................................................. 44
5.3. Principal component analysis ..............................................................................................45
5.3.1. How it works ............................................................................................................... 45
5.3.2. Output ........................................................................................................................ 46
5.3.3. Scree plot ................................................................................................................... 47
5.3.4. Usage ......................................................................................................................... 48
5.3.5. PCA simplifies data ..................................................................................................... 48
5.3.6. Example: possum dataset............................................................................................ 48
5.3.7. Example: nutrition dataset ........................................................................................... 49
5.3.8. Example: influenza ...................................................................................................... 50
5.3.9. Example: enterotypes .................................................................................................. 50


2

, 5.4. T-SNE .................................................................................................................................51
5.4.1. Perplexity .................................................................................................................... 52
5.4.2. t-SNE for single cell RNAseq ........................................................................................ 53
5.5. UMAP .................................................................................................................................53
5.6. UMAP vs t-SNE ....................................................................................................................53

Chapter 6: Linear models ............................................................................................................. 54
6.1. Simple linear regression.......................................................................................................54
6.2. Multiple linear regression .....................................................................................................54
6.3. Supervised learning .............................................................................................................55
6.4. Linear models .....................................................................................................................56
6.4.1. One way ANOVA .......................................................................................................... 56
6.4.2. ANCOVA ..................................................................................................................... 57
6.4.3. Mixed model ............................................................................................................... 58
6.4.4. Akaike information criterion ......................................................................................... 59
6.4.5. Elastic net ................................................................................................................... 59
6.4.6. Regression example .................................................................................................... 60
6.5. Generalised linear models ...................................................................................................60
6.5.1. Linear-response model ................................................................................................ 60
6.5.2. Generalised linear mixed model ................................................................................... 63

Chapter 7: Molecular data analysis .............................................................................................. 64
7.1. Introduction ........................................................................................................................64
7.1.1. Quantitative profiles .................................................................................................... 64
7.1.2. The q-omics data matrix .............................................................................................. 64
7.1.3. Workflow quantitative profiles ...................................................................................... 64
7.2. Transcriptomics ..................................................................................................................65
7.2.1. Expression value variability .......................................................................................... 65
7.2.2. RNAseq introduction ................................................................................................... 65
7.3. DiGerential analysis .............................................................................................................67
7.3.1. Two sample t-test ........................................................................................................ 67
7.3.2. Linear model ............................................................................................................... 67
7.3.3. Diferential analysis ..................................................................................................... 70
7.4. Expression downstream analysis .........................................................................................70
7.5. Proteomics .........................................................................................................................70
7.5.1. Relative vs absolute..................................................................................................... 70
7.5.2. Dynamic range of proteins is a challenge ...................................................................... 71
7.5.3. Three ‘schools’ ............................................................................................................ 71
7.5.4. Protein quantity variability............................................................................................ 71
7.5.5. Quantitative LC/MS processing .................................................................................... 71
7.6. Protein identification ...........................................................................................................72
7.6.1. Tandem MS ................................................................................................................. 72
7.6.2. Quantitative proteomics .............................................................................................. 72
7.6.3. Feature aggregation ..................................................................................................... 73
7.6.4. Example CPTAC........................................................................................................... 73
7.6.5. Example: missing values .............................................................................................. 73
7.6.6. Example: PCA ............................................................................................................. 73

3

, 7.6.7. Example: diferential analysis ...................................................................................... 74
7.7. Metagenomics ....................................................................................................................74
7.7.1. Introduction ................................................................................................................ 74
7.7.2. Binning ....................................................................................................................... 74
7.8. Flow cytometry ...................................................................................................................75
7.8.1. Multiparametric flow cytometry ................................................................................... 75




4

,Chapter 1: Introduction
People have always studied our human body from a multidisciplinary perspective. One perspective is
the perspective on all the molecules of which our body consists of. Our knowledge can be based on
larger amounts of data than ever before. Avalanche of data pushing biomedical science into big data.

Also ar<ficial intelligence is used in data analysis. Trying to use a computer to dig out pa?erns and
knowledge.
• Large scale data and AI brought a new data intensive research paradigm


1.1. Big data
= It is data for which conventional computer-techniques are not sufficient anymore due to size,
complexity, ... Some increases in big data require exponentially more computer capacity. It is a
disruptive trend in computer sciences. It makes us do things that could not be done before.

= moment that you can’t open up your data set in excel and you can’t do anything with it
=> so you need different techniques to study these data sets (one of these techniques is AI)

Big data has 4 important characteris<cs: volume, velocity, variety and veracity.


1.1.1 Volume
= Size of data set that we are working with
• Extremely cheap to generate these amount of data
o The costs of sequencing the human genome is decreasing due to Moore’s law


1.1.2 Velocity
= Speed at which we are genera<ng new data
= data collected at enormous speed
• Smartphone: example of con<nuously genera<ng and collec<ng data
• Data management gap: we are genera<ng way more data but we don’t have the amount of
people to analyse it
• There is the need for new, effec<ve, high-tech data transfer approach
o F.e. put data on a hard drive to transfer data
• Dynamic molecular profiles such as transcriptome profiling, sequencing the immune system,
single cell sequencing


1.1.3 Variety
= Different types of data that are added to our data sets
= Heterogeneous and lots of unstructured data
• Meaning there is not a single type of data
• The data is not simply organized in a simple matrix. A typical example is text or images. You
need context to be able to make sense of it.
• The huge diversity in data types includes DNA sequences, protein structures, gene regulation,
interactions, morphology and metabolism.
5

, 1.1.4 Veracity
= How reliable is our data? How precise where our measurements?,…
• It is a problem in life sciences. There is a lot of heterogeneity in how certain we are of certain
data points.
• There are a lot of potential biases, uncertainties, artefacts,... possible.
• Missing data can also occur and this can be a problem for data mining



1.2 What is data?
Large scale data and AI brought a new data intensive research
paradigm => Data science.
• Collection to “unify statistics, data analysis, informatics, and
their related methods" in order to "understand and analyse
actual phenomena" with data.

= Data is a collec<on of data objects (= samples) and their features
• A feature is a property or characteris<c of an object
o Example: eye colour of a person, temperature, …
o A?ribute is also known as variable, field, characteris<c
or feature
• A collec<on of features describe an object
o Object is also known as record point, case sample, en<ty or instance




1.2.1 Feature values
= are numbers or symbols assigned to a feature
• Dis<nc<on between feature and feature value
o Same feature can be mapped to different feature values
§ Example: height can be measured in feet or meters
o Different features can be mapped to the same set of values
§ Example: a?ribute values for ID and age are integers
o However proper<es of feature values can s<ll be different
§ Example: ID has no limit but age has a max and min value




6
€6,49
Accéder à l'intégralité du document:

Garantie de satisfaction à 100%
Disponible immédiatement après paiement
En ligne et en PDF
Tu n'es attaché à rien

Faites connaissance avec le vendeur
Seller avatar
WillemsenAmber
4,0
(1)

Document également disponible en groupe

Thumbnail
Package deal
Summary data mining: theory and practicals
-
4 2025
€ 23,46 Plus d'infos

Faites connaissance avec le vendeur

Seller avatar
WillemsenAmber Universiteit Antwerpen
Voir profil
S'abonner Vous devez être connecté afin de suivre les étudiants ou les cours
Vendu
8
Membre depuis
1 année
Nombre de followers
0
Documents
42
Dernière vente
1 semaine de cela

4,0

1 revues

5
0
4
1
3
0
2
0
1
0

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Une qualité sur laquelle compter : rédigé par des étudiants qui ont réussi et évalué par d'autres qui ont utilisé ce document.

Le document ne convient pas ? Choisis un autre document

Aucun souci ! Tu peux sélectionner directement un autre document qui correspond mieux à ce que tu cherches.

Paye comme tu veux, apprends aussitôt

Aucun abonnement, aucun engagement. Paye selon tes habitudes par carte de crédit et télécharge ton document PDF instantanément.

Student with book image

“Acheté, téléchargé et réussi. C'est aussi simple que ça.”

Alisha Student

Foire aux questions