Samenvatting

Summary theory data mining

Beoordeling

Verkocht

Pagina's

Geüpload op

29-05-2025

Geschreven in

2024/2025

This is a summary of all the theory handouts of the course data mining. It contains information present on the slides and my own notes. Lessons that are present in this summary are: introduction, data processing, univariate techniques, unsupervised clustering, data projection, linear models and processing omics. There is also a table of contents in the beginning to keep a clear overview during the open book exam.

Meer zien Lees minder

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Geschreven voor

Instelling: Universiteit Antwerpen (UA)
Studie: Biomedische Wetenschappen
Vak: Data mining

Alle documenten voor dit vak (11)

Documentinformatie

Geüpload op: 29 mei 2025
Aantal pagina's: 75
Geschreven in: 2024/2025
Type: Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Table of Contents

Chapter 1: Introduction................................................................................................................. 5
1.1. Big data ............................................................................................................................... 5
1.1.1 Volume ......................................................................................................................... 5
1.1.2 Velocity ......................................................................................................................... 5
1.1.3 Variety .......................................................................................................................... 5
1.1.4 Veracity......................................................................................................................... 6
1.2 What is data? ....................................................................................................................... 6
1.2.1 Feature values............................................................................................................... 6
1.2.2 Feature types ................................................................................................................ 7
1.2.3 Properties of features .................................................................................................... 7
1.2.4 Discrete vs. continuous ................................................................................................. 7
1.2.5 Dataset types ................................................................................................................ 8
1.2. Data mining ......................................................................................................................... 9
1.2.6 Is it data mining? ........................................................................................................... 9
1.2.7 Data mining is related to statistics .................................................................................. 9
1.2.8 Data mining challenges ............................................................................................... 10
1.2.9 Garbage in = garbage out ............................................................................................. 10
1.3 Tasks ..................................................................................................................................11
1.3.1 Two classes of techniques ........................................................................................... 11
1.3.2 Overview molecular applications ................................................................................. 13

Chapter 2: Processing principles .................................................................................................. 14
2.1. Structured data ...................................................................................................................14
2.2. Unstructured data ...............................................................................................................14
2.3. Common data processing steps ...........................................................................................14
2.3.1. Feature extraction ....................................................................................................... 14
2.3.2. Attribute/feature transformation .................................................................................. 15
2.3.3. Discretization .............................................................................................................. 16
2.3.4. Aggregation ................................................................................................................. 16
2.3.5. Noise removal ............................................................................................................. 17
2.3.6. Outlier removal ........................................................................................................... 17
2.3.7. Sampling .................................................................................................................... 17
2.3.8. Handling duplicate data............................................................................................... 18
2.3.9. Handling missing values .............................................................................................. 18
2.3.10. Dimensionality reduction ............................................................................................. 19
2.4. Processing steps for speciﬁc data types ...............................................................................20
2.4.1. Images ........................................................................................................................ 20
2.4.2. Surveys ....................................................................................................................... 20
2.4.3. Sequences .................................................................................................................. 21
2.4.4. Structure data ............................................................................................................. 21
2.4.5. Text data ..................................................................................................................... 22

Chapter 3: Univariate techniques ................................................................................................. 23
3.1. DiGerential analysis .............................................................................................................23
3.1.1. Hypothesis testing....................................................................................................... 23

1

, 3.1.2. t-distribution ............................................................................................................... 24
3.1.3. Central limit theorem................................................................................................... 24
3.1.4. Negative binomial ....................................................................................................... 25
3.2. Multivariate data .................................................................................................................25
3.2.1. What is the distribution of p-values .............................................................................. 25
3.2.2. QQ plot ....................................................................................................................... 25
3.2.3. Multiple testing correction ........................................................................................... 26
3.2.4. GWAS ......................................................................................................................... 28
3.2.5. Statistical test ............................................................................................................. 28
3.3. Functional analysis of large data sets ...................................................................................29
3.3.1. Introduction ................................................................................................................ 29
3.3.2. Overrepresentation analysis (ORA) ............................................................................... 29
3.3.3. Gene set enrichment analysis (GSEA) ........................................................................... 30

Chapter 4: Unsupervised clustering.............................................................................................. 32
4.1. Introduction ........................................................................................................................32
4.1.1. Clustering ................................................................................................................... 32
4.1.2. Similarity .................................................................................................................... 32
4.1.3. Dendrograms (slide 16-23) ........................................................................................... 34
4.1.4. Algorithms .................................................................................................................. 34
4.2. Hierarchical clustering.........................................................................................................35
4.2.1. Single linkage .............................................................................................................. 36
4.2.2. Complete linkage ........................................................................................................ 36
4.2.3. Group average linkage ................................................................................................. 36
4.2.4. Wards linkage.............................................................................................................. 37
4.3. Partitional clustering............................................................................................................37
4.3.1. How to tell right number of clusters? ............................................................................ 37
4.3.2. Objective function: squared error (slide 57) .................................................................. 38
4.3.3. K-means steps ............................................................................................................ 38

Chapter 5: Principal component analysis ..................................................................................... 41
5.1. Multivariate data .................................................................................................................41
5.1.1. Basic variable statistics ............................................................................................... 41
5.1.2. Data transformation .................................................................................................... 42
5.1.3. Normalization ............................................................................................................. 42
5.1.4. Comparison between variables .................................................................................... 42
5.2. Data projection ...................................................................................................................44
5.2.1. What is a projection? ................................................................................................... 44
5.2.2. Why use projections? .................................................................................................. 44
5.3. Principal component analysis ..............................................................................................45
5.3.1. How it works ............................................................................................................... 45
5.3.2. Output ........................................................................................................................ 46
5.3.3. Scree plot ................................................................................................................... 47
5.3.4. Usage ......................................................................................................................... 48
5.3.5. PCA simpliﬁes data ..................................................................................................... 48
5.3.6. Example: possum dataset............................................................................................ 48
5.3.7. Example: nutrition dataset ........................................................................................... 49
5.3.8. Example: inﬂuenza ...................................................................................................... 50
5.3.9. Example: enterotypes .................................................................................................. 50

2

, 5.4. T-SNE .................................................................................................................................51
5.4.1. Perplexity .................................................................................................................... 52
5.4.2. t-SNE for single cell RNAseq ........................................................................................ 53
5.5. UMAP .................................................................................................................................53
5.6. UMAP vs t-SNE ....................................................................................................................53

Chapter 6: Linear models ............................................................................................................. 54
6.1. Simple linear regression.......................................................................................................54
6.2. Multiple linear regression .....................................................................................................54
6.3. Supervised learning .............................................................................................................55
6.4. Linear models .....................................................................................................................56
6.4.1. One way ANOVA .......................................................................................................... 56
6.4.2. ANCOVA ..................................................................................................................... 57
6.4.3. Mixed model ............................................................................................................... 58
6.4.4. Akaike information criterion ......................................................................................... 59
6.4.5. Elastic net ................................................................................................................... 59
6.4.6. Regression example .................................................................................................... 60
6.5. Generalised linear models ...................................................................................................60
6.5.1. Linear-response model ................................................................................................ 60
6.5.2. Generalised linear mixed model ................................................................................... 63

Chapter 7: Molecular data analysis .............................................................................................. 64
7.1. Introduction ........................................................................................................................64
7.1.1. Quantitative proﬁles .................................................................................................... 64
7.1.2. The q-omics data matrix .............................................................................................. 64
7.1.3. Workﬂow quantitative proﬁles ...................................................................................... 64
7.2. Transcriptomics ..................................................................................................................65
7.2.1. Expression value variability .......................................................................................... 65
7.2.2. RNAseq introduction ................................................................................................... 65
7.3. DiGerential analysis .............................................................................................................67
7.3.1. Two sample t-test ........................................................................................................ 67
7.3.2. Linear model ............................................................................................................... 67
7.3.3. Diferential analysis ..................................................................................................... 70
7.4. Expression downstream analysis .........................................................................................70
7.5. Proteomics .........................................................................................................................70
7.5.1. Relative vs absolute..................................................................................................... 70
7.5.2. Dynamic range of proteins is a challenge ...................................................................... 71
7.5.3. Three ‘schools’ ............................................................................................................ 71
7.5.4. Protein quantity variability............................................................................................ 71
7.5.5. Quantitative LC/MS processing .................................................................................... 71
7.6. Protein identiﬁcation ...........................................................................................................72
7.6.1. Tandem MS ................................................................................................................. 72
7.6.2. Quantitative proteomics .............................................................................................. 72
7.6.3. Feature aggregation ..................................................................................................... 73
7.6.4. Example CPTAC........................................................................................................... 73
7.6.5. Example: missing values .............................................................................................. 73
7.6.6. Example: PCA ............................................................................................................. 73

3

, 7.6.7. Example: diferential analysis ...................................................................................... 74
7.7. Metagenomics ....................................................................................................................74
7.7.1. Introduction ................................................................................................................ 74
7.7.2. Binning ....................................................................................................................... 74
7.8. Flow cytometry ...................................................................................................................75
7.8.1. Multiparametric ﬂow cytometry ................................................................................... 75

4

,Chapter 1: Introduction
People have always studied our human body from a multidisciplinary perspective. One perspective is
the perspective on all the molecules of which our body consists of. Our knowledge can be based on
larger amounts of data than ever before. Avalanche of data pushing biomedical science into big data.

Also ar<ﬁcial intelligence is used in data analysis. Trying to use a computer to dig out pa?erns and
knowledge.
• Large scale data and AI brought a new data intensive research paradigm

1.1. Big data
= It is data for which conventional computer-techniques are not sufficient anymore due to size,
complexity, ... Some increases in big data require exponentially more computer capacity. It is a
disruptive trend in computer sciences. It makes us do things that could not be done before.

= moment that you can’t open up your data set in excel and you can’t do anything with it
=> so you need diﬀerent techniques to study these data sets (one of these techniques is AI)

Big data has 4 important characteris<cs: volume, velocity, variety and veracity.

1.1.1 Volume
= Size of data set that we are working with
• Extremely cheap to generate these amount of data
o The costs of sequencing the human genome is decreasing due to Moore’s law

1.1.2 Velocity
= Speed at which we are genera<ng new data
= data collected at enormous speed
• Smartphone: example of con<nuously genera<ng and collec<ng data
• Data management gap: we are genera<ng way more data but we don’t have the amount of
people to analyse it
• There is the need for new, eﬀec<ve, high-tech data transfer approach
o F.e. put data on a hard drive to transfer data
• Dynamic molecular proﬁles such as transcriptome proﬁling, sequencing the immune system,
single cell sequencing

1.1.3 Variety
= Diﬀerent types of data that are added to our data sets
= Heterogeneous and lots of unstructured data
• Meaning there is not a single type of data
• The data is not simply organized in a simple matrix. A typical example is text or images. You
need context to be able to make sense of it.
• The huge diversity in data types includes DNA sequences, protein structures, gene regulation,
interactions, morphology and metabolism.
5

, 1.1.4 Veracity
= How reliable is our data? How precise where our measurements?,…
• It is a problem in life sciences. There is a lot of heterogeneity in how certain we are of certain
data points.
• There are a lot of potential biases, uncertainties, artefacts,... possible.
• Missing data can also occur and this can be a problem for data mining

1.2 What is data?
Large scale data and AI brought a new data intensive research
paradigm => Data science.
• Collection to “unify statistics, data analysis, informatics, and
their related methods" in order to "understand and analyse
actual phenomena" with data.

= Data is a collec<on of data objects (= samples) and their features
• A feature is a property or characteris<c of an object
o Example: eye colour of a person, temperature, …
o A?ribute is also known as variable, ﬁeld, characteris<c
or feature
• A collec<on of features describe an object
o Object is also known as record point, case sample, en<ty or instance

1.2.1 Feature values
= are numbers or symbols assigned to a feature
• Dis<nc<on between feature and feature value
o Same feature can be mapped to diﬀerent feature values
§ Example: height can be measured in feet or meters
o Diﬀerent features can be mapped to the same set of values
§ Example: a?ribute values for ID and age are integers
o However proper<es of feature values can s<ll be diﬀerent
§ Example: ID has no limit but age has a max and min value

6

€6,99

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

WillemsenAmber

4,0

(1)

Ook beschikbaar in voordeelbundel

Maak kennis met de verkoper

WillemsenAmber Universiteit Antwerpen

Bekijk profiel

Volgen

Verkocht

Lid sinds

2 jaar

Aantal volgers

Documenten

Laatst verkocht

1 maand geleden

4,0

1 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via Bancontact, iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo eenvoudig kan het zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper WillemsenAmber. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 52514 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Summary theory data mining

Geschreven voor

Documentinformatie

Onderwerpen

Voorbeeld van de inhoud

Meer vakken binnen Universiteit Antwerpen (UA) > Biomedische Wetenschappen

Ook beschikbaar in voordeelbundel

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?