Resume

Summary theory data mining

Note

Vendu

Pages

Publié le

29-05-2025

Écrit en

2024/2025

This is a summary of all the theory handouts of the course data mining. It contains information present on the slides and my own notes. Lessons that are present in this summary are: introduction, data processing, univariate techniques, unsupervised clustering, data projection, linear models and processing omics. There is also a table of contents in the beginning to keep a clear overview during the open book exam.

Montrer plus Lire moins

Établissement

Cours

Oups ! Impossible de charger votre document. Réessayez ou contactez le support.

Signaler une violation de copyright

École, étude et sujet

Établissement: Universiteit Antwerpen (UA)
Cours: Biomedische Wetenschappen
Cours: Data mining

Tous les documents sur ce sujet (11)

Infos sur le Document

Publié le: 29 mai 2025
Nombre de pages: 75
Écrit en: 2024/2025
Type: Resume

Sujets

Aperçu du contenu

Table of Contents

Chapter 1: Introduction................................................................................................................. 5
1.1. Big data ............................................................................................................................... 5
1.1.1 Volume ......................................................................................................................... 5
1.1.2 Velocity ......................................................................................................................... 5
1.1.3 Variety .......................................................................................................................... 5
1.1.4 Veracity......................................................................................................................... 6
1.2 What is data? ....................................................................................................................... 6
1.2.1 Feature values............................................................................................................... 6
1.2.2 Feature types ................................................................................................................ 7
1.2.3 Properties of features .................................................................................................... 7
1.2.4 Discrete vs. continuous ................................................................................................. 7
1.2.5 Dataset types ................................................................................................................ 8
1.2. Data mining ......................................................................................................................... 9
1.2.6 Is it data mining? ........................................................................................................... 9
1.2.7 Data mining is related to statistics .................................................................................. 9
1.2.8 Data mining challenges ............................................................................................... 10
1.2.9 Garbage in = garbage out ............................................................................................. 10
1.3 Tasks ..................................................................................................................................11
1.3.1 Two classes of techniques ........................................................................................... 11
1.3.2 Overview molecular applications ................................................................................. 13

Chapter 2: Processing principles .................................................................................................. 14
2.1. Structured data ...................................................................................................................14
2.2. Unstructured data ...............................................................................................................14
2.3. Common data processing steps ...........................................................................................14
2.3.1. Feature extraction ....................................................................................................... 14
2.3.2. Attribute/feature transformation .................................................................................. 15
2.3.3. Discretization .............................................................................................................. 16
2.3.4. Aggregation ................................................................................................................. 16
2.3.5. Noise removal ............................................................................................................. 17
2.3.6. Outlier removal ........................................................................................................... 17
2.3.7. Sampling .................................................................................................................... 17
2.3.8. Handling duplicate data............................................................................................... 18
2.3.9. Handling missing values .............................................................................................. 18
2.3.10. Dimensionality reduction ............................................................................................. 19
2.4. Processing steps for speciﬁc data types ...............................................................................20
2.4.1. Images ........................................................................................................................ 20
2.4.2. Surveys ....................................................................................................................... 20
2.4.3. Sequences .................................................................................................................. 21
2.4.4. Structure data ............................................................................................................. 21
2.4.5. Text data ..................................................................................................................... 22

Chapter 3: Univariate techniques ................................................................................................. 23
3.1. DiGerential analysis .............................................................................................................23
3.1.1. Hypothesis testing....................................................................................................... 23

1

, 3.1.2. t-distribution ............................................................................................................... 24
3.1.3. Central limit theorem................................................................................................... 24
3.1.4. Negative binomial ....................................................................................................... 25
3.2. Multivariate data .................................................................................................................25
3.2.1. What is the distribution of p-values .............................................................................. 25
3.2.2. QQ plot ....................................................................................................................... 25
3.2.3. Multiple testing correction ........................................................................................... 26
3.2.4. GWAS ......................................................................................................................... 28
3.2.5. Statistical test ............................................................................................................. 28
3.3. Functional analysis of large data sets ...................................................................................29
3.3.1. Introduction ................................................................................................................ 29
3.3.2. Overrepresentation analysis (ORA) ............................................................................... 29
3.3.3. Gene set enrichment analysis (GSEA) ........................................................................... 30

Chapter 4: Unsupervised clustering.............................................................................................. 32
4.1. Introduction ........................................................................................................................32
4.1.1. Clustering ................................................................................................................... 32
4.1.2. Similarity .................................................................................................................... 32
4.1.3. Dendrograms (slide 16-23) ........................................................................................... 34
4.1.4. Algorithms .................................................................................................................. 34
4.2. Hierarchical clustering.........................................................................................................35
4.2.1. Single linkage .............................................................................................................. 36
4.2.2. Complete linkage ........................................................................................................ 36
4.2.3. Group average linkage ................................................................................................. 36
4.2.4. Wards linkage.............................................................................................................. 37
4.3. Partitional clustering............................................................................................................37
4.3.1. How to tell right number of clusters? ............................................................................ 37
4.3.2. Objective function: squared error (slide 57) .................................................................. 38
4.3.3. K-means steps ............................................................................................................ 38

Chapter 5: Principal component analysis ..................................................................................... 41
5.1. Multivariate data .................................................................................................................41
5.1.1. Basic variable statistics ............................................................................................... 41
5.1.2. Data transformation .................................................................................................... 42
5.1.3. Normalization ............................................................................................................. 42
5.1.4. Comparison between variables .................................................................................... 42
5.2. Data projection ...................................................................................................................44
5.2.1. What is a projection? ................................................................................................... 44
5.2.2. Why use projections? .................................................................................................. 44
5.3. Principal component analysis ..............................................................................................45
5.3.1. How it works ............................................................................................................... 45
5.3.2. Output ........................................................................................................................ 46
5.3.3. Scree plot ................................................................................................................... 47
5.3.4. Usage ......................................................................................................................... 48
5.3.5. PCA simpliﬁes data ..................................................................................................... 48
5.3.6. Example: possum dataset............................................................................................ 48
5.3.7. Example: nutrition dataset ........................................................................................... 49
5.3.8. Example: inﬂuenza ...................................................................................................... 50
5.3.9. Example: enterotypes .................................................................................................. 50

2

, 5.4. T-SNE .................................................................................................................................51
5.4.1. Perplexity .................................................................................................................... 52
5.4.2. t-SNE for single cell RNAseq ........................................................................................ 53
5.5. UMAP .................................................................................................................................53
5.6. UMAP vs t-SNE ....................................................................................................................53

Chapter 6: Linear models ............................................................................................................. 54
6.1. Simple linear regression.......................................................................................................54
6.2. Multiple linear regression .....................................................................................................54
6.3. Supervised learning .............................................................................................................55
6.4. Linear models .....................................................................................................................56
6.4.1. One way ANOVA .......................................................................................................... 56
6.4.2. ANCOVA ..................................................................................................................... 57
6.4.3. Mixed model ............................................................................................................... 58
6.4.4. Akaike information criterion ......................................................................................... 59
6.4.5. Elastic net ................................................................................................................... 59
6.4.6. Regression example .................................................................................................... 60
6.5. Generalised linear models ...................................................................................................60
6.5.1. Linear-response model ................................................................................................ 60
6.5.2. Generalised linear mixed model ................................................................................... 63

Chapter 7: Molecular data analysis .............................................................................................. 64
7.1. Introduction ........................................................................................................................64
7.1.1. Quantitative proﬁles .................................................................................................... 64
7.1.2. The q-omics data matrix .............................................................................................. 64
7.1.3. Workﬂow quantitative proﬁles ...................................................................................... 64
7.2. Transcriptomics ..................................................................................................................65
7.2.1. Expression value variability .......................................................................................... 65
7.2.2. RNAseq introduction ................................................................................................... 65
7.3. DiGerential analysis .............................................................................................................67
7.3.1. Two sample t-test ........................................................................................................ 67
7.3.2. Linear model ............................................................................................................... 67
7.3.3. Diferential analysis ..................................................................................................... 70
7.4. Expression downstream analysis .........................................................................................70
7.5. Proteomics .........................................................................................................................70
7.5.1. Relative vs absolute..................................................................................................... 70
7.5.2. Dynamic range of proteins is a challenge ...................................................................... 71
7.5.3. Three ‘schools’ ............................................................................................................ 71
7.5.4. Protein quantity variability............................................................................................ 71
7.5.5. Quantitative LC/MS processing .................................................................................... 71
7.6. Protein identiﬁcation ...........................................................................................................72
7.6.1. Tandem MS ................................................................................................................. 72
7.6.2. Quantitative proteomics .............................................................................................. 72
7.6.3. Feature aggregation ..................................................................................................... 73
7.6.4. Example CPTAC........................................................................................................... 73
7.6.5. Example: missing values .............................................................................................. 73
7.6.6. Example: PCA ............................................................................................................. 73

3

, 7.6.7. Example: diferential analysis ...................................................................................... 74
7.7. Metagenomics ....................................................................................................................74
7.7.1. Introduction ................................................................................................................ 74
7.7.2. Binning ....................................................................................................................... 74
7.8. Flow cytometry ...................................................................................................................75
7.8.1. Multiparametric ﬂow cytometry ................................................................................... 75

4

,Chapter 1: Introduction
People have always studied our human body from a multidisciplinary perspective. One perspective is
the perspective on all the molecules of which our body consists of. Our knowledge can be based on
larger amounts of data than ever before. Avalanche of data pushing biomedical science into big data.

Also ar<ﬁcial intelligence is used in data analysis. Trying to use a computer to dig out pa?erns and
knowledge.
• Large scale data and AI brought a new data intensive research paradigm

1.1. Big data
= It is data for which conventional computer-techniques are not sufficient anymore due to size,
complexity, ... Some increases in big data require exponentially more computer capacity. It is a
disruptive trend in computer sciences. It makes us do things that could not be done before.

= moment that you can’t open up your data set in excel and you can’t do anything with it
=> so you need diﬀerent techniques to study these data sets (one of these techniques is AI)

Big data has 4 important characteris<cs: volume, velocity, variety and veracity.

1.1.1 Volume
= Size of data set that we are working with
• Extremely cheap to generate these amount of data
o The costs of sequencing the human genome is decreasing due to Moore’s law

1.1.2 Velocity
= Speed at which we are genera<ng new data
= data collected at enormous speed
• Smartphone: example of con<nuously genera<ng and collec<ng data
• Data management gap: we are genera<ng way more data but we don’t have the amount of
people to analyse it
• There is the need for new, eﬀec<ve, high-tech data transfer approach
o F.e. put data on a hard drive to transfer data
• Dynamic molecular proﬁles such as transcriptome proﬁling, sequencing the immune system,
single cell sequencing

1.1.3 Variety
= Diﬀerent types of data that are added to our data sets
= Heterogeneous and lots of unstructured data
• Meaning there is not a single type of data
• The data is not simply organized in a simple matrix. A typical example is text or images. You
need context to be able to make sense of it.
• The huge diversity in data types includes DNA sequences, protein structures, gene regulation,
interactions, morphology and metabolism.
5

, 1.1.4 Veracity
= How reliable is our data? How precise where our measurements?,…
• It is a problem in life sciences. There is a lot of heterogeneity in how certain we are of certain
data points.
• There are a lot of potential biases, uncertainties, artefacts,... possible.
• Missing data can also occur and this can be a problem for data mining

1.2 What is data?
Large scale data and AI brought a new data intensive research
paradigm => Data science.
• Collection to “unify statistics, data analysis, informatics, and
their related methods" in order to "understand and analyse
actual phenomena" with data.

= Data is a collec<on of data objects (= samples) and their features
• A feature is a property or characteris<c of an object
o Example: eye colour of a person, temperature, …
o A?ribute is also known as variable, ﬁeld, characteris<c
or feature
• A collec<on of features describe an object
o Object is also known as record point, case sample, en<ty or instance

1.2.1 Feature values
= are numbers or symbols assigned to a feature
• Dis<nc<on between feature and feature value
o Same feature can be mapped to diﬀerent feature values
§ Example: height can be measured in feet or meters
o Diﬀerent features can be mapped to the same set of values
§ Example: a?ribute values for ID and age are integers
o However proper<es of feature values can s<ll be diﬀerent
§ Example: ID has no limit but age has a max and min value

6

€6,49

Accéder à l'intégralité du document:

Garantie de satisfaction à 100%

Disponible immédiatement après paiement

En ligne et en PDF

Tu n'es attaché à rien

Faites connaissance avec le vendeur

WillemsenAmber

4,0

(1)

Document également disponible en groupe

Faites connaissance avec le vendeur

WillemsenAmber Universiteit Antwerpen

Voir profil

Vendu

Membre depuis

1 année

Nombre de followers

Documents

Dernière vente

1 semaine de cela

4,0

1 revues

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Une qualité sur laquelle compter : rédigé par des étudiants qui ont réussi et évalué par d'autres qui ont utilisé ce document.

Le document ne convient pas ? Choisis un autre document

Aucun souci ! Tu peux sélectionner directement un autre document qui correspond mieux à ce que tu cherches.

Paye comme tu veux, apprends aussitôt

Aucun abonnement, aucun engagement. Paye selon tes habitudes par carte de crédit et télécharge ton document PDF instantanément.

“Acheté, téléchargé et réussi. C'est aussi simple que ça.”

Alisha Student

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur WillemsenAmber. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €6,49. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis) 50201 résumés ont été vendus ces 30 derniers jours Fondée en 2010, la référence pour acheter des résumés depuis déjà 15 ans

Summary theory data mining

École, étude et sujet

Infos sur le Document

Sujets

Aperçu du contenu

Plus de cours sur Universiteit Antwerpen (UA) > Biomedische Wetenschappen

Document également disponible en groupe

Faites connaissance avec le vendeur

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Le document ne convient pas ? Choisis un autre document

Paye comme tu veux, apprends aussitôt

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Garantie de remboursement : comment ça marche ?

Auprès de qui est-ce que j'achète ce résumé ?

Est-ce que j'aurai un abonnement?

Peut-on faire confiance à Stuvia ?