100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Summary theory data mining

Beoordeling
-
Verkocht
1
Pagina's
75
Geüpload op
29-05-2025
Geschreven in
2024/2025

This is a summary of all the theory handouts of the course data mining. It contains information present on the slides and my own notes. Lessons that are present in this summary are: introduction, data processing, univariate techniques, unsupervised clustering, data projection, linear models and processing omics. There is also a table of contents in the beginning to keep a clear overview during the open book exam.

Meer zien Lees minder













Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
29 mei 2025
Aantal pagina's
75
Geschreven in
2024/2025
Type
Samenvatting

Voorbeeld van de inhoud

Table of Contents

Chapter 1: Introduction................................................................................................................. 5
1.1. Big data ............................................................................................................................... 5
1.1.1 Volume ......................................................................................................................... 5
1.1.2 Velocity ......................................................................................................................... 5
1.1.3 Variety .......................................................................................................................... 5
1.1.4 Veracity......................................................................................................................... 6
1.2 What is data? ....................................................................................................................... 6
1.2.1 Feature values............................................................................................................... 6
1.2.2 Feature types ................................................................................................................ 7
1.2.3 Properties of features .................................................................................................... 7
1.2.4 Discrete vs. continuous ................................................................................................. 7
1.2.5 Dataset types ................................................................................................................ 8
1.2. Data mining ......................................................................................................................... 9
1.2.6 Is it data mining? ........................................................................................................... 9
1.2.7 Data mining is related to statistics .................................................................................. 9
1.2.8 Data mining challenges ............................................................................................... 10
1.2.9 Garbage in = garbage out ............................................................................................. 10
1.3 Tasks ..................................................................................................................................11
1.3.1 Two classes of techniques ........................................................................................... 11
1.3.2 Overview molecular applications ................................................................................. 13

Chapter 2: Processing principles .................................................................................................. 14
2.1. Structured data ...................................................................................................................14
2.2. Unstructured data ...............................................................................................................14
2.3. Common data processing steps ...........................................................................................14
2.3.1. Feature extraction ....................................................................................................... 14
2.3.2. Attribute/feature transformation .................................................................................. 15
2.3.3. Discretization .............................................................................................................. 16
2.3.4. Aggregation ................................................................................................................. 16
2.3.5. Noise removal ............................................................................................................. 17
2.3.6. Outlier removal ........................................................................................................... 17
2.3.7. Sampling .................................................................................................................... 17
2.3.8. Handling duplicate data............................................................................................... 18
2.3.9. Handling missing values .............................................................................................. 18
2.3.10. Dimensionality reduction ............................................................................................. 19
2.4. Processing steps for specific data types ...............................................................................20
2.4.1. Images ........................................................................................................................ 20
2.4.2. Surveys ....................................................................................................................... 20
2.4.3. Sequences .................................................................................................................. 21
2.4.4. Structure data ............................................................................................................. 21
2.4.5. Text data ..................................................................................................................... 22

Chapter 3: Univariate techniques ................................................................................................. 23
3.1. DiGerential analysis .............................................................................................................23
3.1.1. Hypothesis testing....................................................................................................... 23

1

, 3.1.2. t-distribution ............................................................................................................... 24
3.1.3. Central limit theorem................................................................................................... 24
3.1.4. Negative binomial ....................................................................................................... 25
3.2. Multivariate data .................................................................................................................25
3.2.1. What is the distribution of p-values .............................................................................. 25
3.2.2. QQ plot ....................................................................................................................... 25
3.2.3. Multiple testing correction ........................................................................................... 26
3.2.4. GWAS ......................................................................................................................... 28
3.2.5. Statistical test ............................................................................................................. 28
3.3. Functional analysis of large data sets ...................................................................................29
3.3.1. Introduction ................................................................................................................ 29
3.3.2. Overrepresentation analysis (ORA) ............................................................................... 29
3.3.3. Gene set enrichment analysis (GSEA) ........................................................................... 30

Chapter 4: Unsupervised clustering.............................................................................................. 32
4.1. Introduction ........................................................................................................................32
4.1.1. Clustering ................................................................................................................... 32
4.1.2. Similarity .................................................................................................................... 32
4.1.3. Dendrograms (slide 16-23) ........................................................................................... 34
4.1.4. Algorithms .................................................................................................................. 34
4.2. Hierarchical clustering.........................................................................................................35
4.2.1. Single linkage .............................................................................................................. 36
4.2.2. Complete linkage ........................................................................................................ 36
4.2.3. Group average linkage ................................................................................................. 36
4.2.4. Wards linkage.............................................................................................................. 37
4.3. Partitional clustering............................................................................................................37
4.3.1. How to tell right number of clusters? ............................................................................ 37
4.3.2. Objective function: squared error (slide 57) .................................................................. 38
4.3.3. K-means steps ............................................................................................................ 38

Chapter 5: Principal component analysis ..................................................................................... 41
5.1. Multivariate data .................................................................................................................41
5.1.1. Basic variable statistics ............................................................................................... 41
5.1.2. Data transformation .................................................................................................... 42
5.1.3. Normalization ............................................................................................................. 42
5.1.4. Comparison between variables .................................................................................... 42
5.2. Data projection ...................................................................................................................44
5.2.1. What is a projection? ................................................................................................... 44
5.2.2. Why use projections? .................................................................................................. 44
5.3. Principal component analysis ..............................................................................................45
5.3.1. How it works ............................................................................................................... 45
5.3.2. Output ........................................................................................................................ 46
5.3.3. Scree plot ................................................................................................................... 47
5.3.4. Usage ......................................................................................................................... 48
5.3.5. PCA simplifies data ..................................................................................................... 48
5.3.6. Example: possum dataset............................................................................................ 48
5.3.7. Example: nutrition dataset ........................................................................................... 49
5.3.8. Example: influenza ...................................................................................................... 50
5.3.9. Example: enterotypes .................................................................................................. 50


2

, 5.4. T-SNE .................................................................................................................................51
5.4.1. Perplexity .................................................................................................................... 52
5.4.2. t-SNE for single cell RNAseq ........................................................................................ 53
5.5. UMAP .................................................................................................................................53
5.6. UMAP vs t-SNE ....................................................................................................................53

Chapter 6: Linear models ............................................................................................................. 54
6.1. Simple linear regression.......................................................................................................54
6.2. Multiple linear regression .....................................................................................................54
6.3. Supervised learning .............................................................................................................55
6.4. Linear models .....................................................................................................................56
6.4.1. One way ANOVA .......................................................................................................... 56
6.4.2. ANCOVA ..................................................................................................................... 57
6.4.3. Mixed model ............................................................................................................... 58
6.4.4. Akaike information criterion ......................................................................................... 59
6.4.5. Elastic net ................................................................................................................... 59
6.4.6. Regression example .................................................................................................... 60
6.5. Generalised linear models ...................................................................................................60
6.5.1. Linear-response model ................................................................................................ 60
6.5.2. Generalised linear mixed model ................................................................................... 63

Chapter 7: Molecular data analysis .............................................................................................. 64
7.1. Introduction ........................................................................................................................64
7.1.1. Quantitative profiles .................................................................................................... 64
7.1.2. The q-omics data matrix .............................................................................................. 64
7.1.3. Workflow quantitative profiles ...................................................................................... 64
7.2. Transcriptomics ..................................................................................................................65
7.2.1. Expression value variability .......................................................................................... 65
7.2.2. RNAseq introduction ................................................................................................... 65
7.3. DiGerential analysis .............................................................................................................67
7.3.1. Two sample t-test ........................................................................................................ 67
7.3.2. Linear model ............................................................................................................... 67
7.3.3. Diferential analysis ..................................................................................................... 70
7.4. Expression downstream analysis .........................................................................................70
7.5. Proteomics .........................................................................................................................70
7.5.1. Relative vs absolute..................................................................................................... 70
7.5.2. Dynamic range of proteins is a challenge ...................................................................... 71
7.5.3. Three ‘schools’ ............................................................................................................ 71
7.5.4. Protein quantity variability............................................................................................ 71
7.5.5. Quantitative LC/MS processing .................................................................................... 71
7.6. Protein identification ...........................................................................................................72
7.6.1. Tandem MS ................................................................................................................. 72
7.6.2. Quantitative proteomics .............................................................................................. 72
7.6.3. Feature aggregation ..................................................................................................... 73
7.6.4. Example CPTAC........................................................................................................... 73
7.6.5. Example: missing values .............................................................................................. 73
7.6.6. Example: PCA ............................................................................................................. 73

3

, 7.6.7. Example: diferential analysis ...................................................................................... 74
7.7. Metagenomics ....................................................................................................................74
7.7.1. Introduction ................................................................................................................ 74
7.7.2. Binning ....................................................................................................................... 74
7.8. Flow cytometry ...................................................................................................................75
7.8.1. Multiparametric flow cytometry ................................................................................... 75




4

,Chapter 1: Introduction
People have always studied our human body from a multidisciplinary perspective. One perspective is
the perspective on all the molecules of which our body consists of. Our knowledge can be based on
larger amounts of data than ever before. Avalanche of data pushing biomedical science into big data.

Also ar<ficial intelligence is used in data analysis. Trying to use a computer to dig out pa?erns and
knowledge.
• Large scale data and AI brought a new data intensive research paradigm


1.1. Big data
= It is data for which conventional computer-techniques are not sufficient anymore due to size,
complexity, ... Some increases in big data require exponentially more computer capacity. It is a
disruptive trend in computer sciences. It makes us do things that could not be done before.

= moment that you can’t open up your data set in excel and you can’t do anything with it
=> so you need different techniques to study these data sets (one of these techniques is AI)

Big data has 4 important characteris<cs: volume, velocity, variety and veracity.


1.1.1 Volume
= Size of data set that we are working with
• Extremely cheap to generate these amount of data
o The costs of sequencing the human genome is decreasing due to Moore’s law


1.1.2 Velocity
= Speed at which we are genera<ng new data
= data collected at enormous speed
• Smartphone: example of con<nuously genera<ng and collec<ng data
• Data management gap: we are genera<ng way more data but we don’t have the amount of
people to analyse it
• There is the need for new, effec<ve, high-tech data transfer approach
o F.e. put data on a hard drive to transfer data
• Dynamic molecular profiles such as transcriptome profiling, sequencing the immune system,
single cell sequencing


1.1.3 Variety
= Different types of data that are added to our data sets
= Heterogeneous and lots of unstructured data
• Meaning there is not a single type of data
• The data is not simply organized in a simple matrix. A typical example is text or images. You
need context to be able to make sense of it.
• The huge diversity in data types includes DNA sequences, protein structures, gene regulation,
interactions, morphology and metabolism.
5

, 1.1.4 Veracity
= How reliable is our data? How precise where our measurements?,…
• It is a problem in life sciences. There is a lot of heterogeneity in how certain we are of certain
data points.
• There are a lot of potential biases, uncertainties, artefacts,... possible.
• Missing data can also occur and this can be a problem for data mining



1.2 What is data?
Large scale data and AI brought a new data intensive research
paradigm => Data science.
• Collection to “unify statistics, data analysis, informatics, and
their related methods" in order to "understand and analyse
actual phenomena" with data.

= Data is a collec<on of data objects (= samples) and their features
• A feature is a property or characteris<c of an object
o Example: eye colour of a person, temperature, …
o A?ribute is also known as variable, field, characteris<c
or feature
• A collec<on of features describe an object
o Object is also known as record point, case sample, en<ty or instance




1.2.1 Feature values
= are numbers or symbols assigned to a feature
• Dis<nc<on between feature and feature value
o Same feature can be mapped to different feature values
§ Example: height can be measured in feet or meters
o Different features can be mapped to the same set of values
§ Example: a?ribute values for ID and age are integers
o However proper<es of feature values can s<ll be different
§ Example: ID has no limit but age has a max and min value




6
€6,49
Krijg toegang tot het volledige document:

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Maak kennis met de verkoper
Seller avatar
WillemsenAmber
4,0
(1)

Ook beschikbaar in voordeelbundel

Thumbnail
Voordeelbundel
Summary data mining: theory and practicals
-
4 2025
€ 23,46 Meer info

Maak kennis met de verkoper

Seller avatar
WillemsenAmber Universiteit Antwerpen
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
8
Lid sinds
1 jaar
Aantal volgers
0
Documenten
42
Laatst verkocht
1 week geleden

4,0

1 beoordelingen

5
0
4
1
3
0
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via Bancontact, iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo eenvoudig kan het zijn.”

Alisha Student

Veelgestelde vragen