100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary theory data mining

Rating
-
Sold
1
Pages
75
Uploaded on
29-05-2025
Written in
2024/2025

This is a summary of all the theory handouts of the course data mining. It contains information present on the slides and my own notes. Lessons that are present in this summary are: introduction, data processing, univariate techniques, unsupervised clustering, data projection, linear models and processing omics. There is also a table of contents in the beginning to keep a clear overview during the open book exam.

Show more Read less
Institution
Course













Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
May 29, 2025
Number of pages
75
Written in
2024/2025
Type
Summary

Subjects

Content preview

Table of Contents

Chapter 1: Introduction................................................................................................................. 5
1.1. Big data ............................................................................................................................... 5
1.1.1 Volume ......................................................................................................................... 5
1.1.2 Velocity ......................................................................................................................... 5
1.1.3 Variety .......................................................................................................................... 5
1.1.4 Veracity......................................................................................................................... 6
1.2 What is data? ....................................................................................................................... 6
1.2.1 Feature values............................................................................................................... 6
1.2.2 Feature types ................................................................................................................ 7
1.2.3 Properties of features .................................................................................................... 7
1.2.4 Discrete vs. continuous ................................................................................................. 7
1.2.5 Dataset types ................................................................................................................ 8
1.2. Data mining ......................................................................................................................... 9
1.2.6 Is it data mining? ........................................................................................................... 9
1.2.7 Data mining is related to statistics .................................................................................. 9
1.2.8 Data mining challenges ............................................................................................... 10
1.2.9 Garbage in = garbage out ............................................................................................. 10
1.3 Tasks ..................................................................................................................................11
1.3.1 Two classes of techniques ........................................................................................... 11
1.3.2 Overview molecular applications ................................................................................. 13

Chapter 2: Processing principles .................................................................................................. 14
2.1. Structured data ...................................................................................................................14
2.2. Unstructured data ...............................................................................................................14
2.3. Common data processing steps ...........................................................................................14
2.3.1. Feature extraction ....................................................................................................... 14
2.3.2. Attribute/feature transformation .................................................................................. 15
2.3.3. Discretization .............................................................................................................. 16
2.3.4. Aggregation ................................................................................................................. 16
2.3.5. Noise removal ............................................................................................................. 17
2.3.6. Outlier removal ........................................................................................................... 17
2.3.7. Sampling .................................................................................................................... 17
2.3.8. Handling duplicate data............................................................................................... 18
2.3.9. Handling missing values .............................................................................................. 18
2.3.10. Dimensionality reduction ............................................................................................. 19
2.4. Processing steps for specific data types ...............................................................................20
2.4.1. Images ........................................................................................................................ 20
2.4.2. Surveys ....................................................................................................................... 20
2.4.3. Sequences .................................................................................................................. 21
2.4.4. Structure data ............................................................................................................. 21
2.4.5. Text data ..................................................................................................................... 22

Chapter 3: Univariate techniques ................................................................................................. 23
3.1. DiGerential analysis .............................................................................................................23
3.1.1. Hypothesis testing....................................................................................................... 23

1

, 3.1.2. t-distribution ............................................................................................................... 24
3.1.3. Central limit theorem................................................................................................... 24
3.1.4. Negative binomial ....................................................................................................... 25
3.2. Multivariate data .................................................................................................................25
3.2.1. What is the distribution of p-values .............................................................................. 25
3.2.2. QQ plot ....................................................................................................................... 25
3.2.3. Multiple testing correction ........................................................................................... 26
3.2.4. GWAS ......................................................................................................................... 28
3.2.5. Statistical test ............................................................................................................. 28
3.3. Functional analysis of large data sets ...................................................................................29
3.3.1. Introduction ................................................................................................................ 29
3.3.2. Overrepresentation analysis (ORA) ............................................................................... 29
3.3.3. Gene set enrichment analysis (GSEA) ........................................................................... 30

Chapter 4: Unsupervised clustering.............................................................................................. 32
4.1. Introduction ........................................................................................................................32
4.1.1. Clustering ................................................................................................................... 32
4.1.2. Similarity .................................................................................................................... 32
4.1.3. Dendrograms (slide 16-23) ........................................................................................... 34
4.1.4. Algorithms .................................................................................................................. 34
4.2. Hierarchical clustering.........................................................................................................35
4.2.1. Single linkage .............................................................................................................. 36
4.2.2. Complete linkage ........................................................................................................ 36
4.2.3. Group average linkage ................................................................................................. 36
4.2.4. Wards linkage.............................................................................................................. 37
4.3. Partitional clustering............................................................................................................37
4.3.1. How to tell right number of clusters? ............................................................................ 37
4.3.2. Objective function: squared error (slide 57) .................................................................. 38
4.3.3. K-means steps ............................................................................................................ 38

Chapter 5: Principal component analysis ..................................................................................... 41
5.1. Multivariate data .................................................................................................................41
5.1.1. Basic variable statistics ............................................................................................... 41
5.1.2. Data transformation .................................................................................................... 42
5.1.3. Normalization ............................................................................................................. 42
5.1.4. Comparison between variables .................................................................................... 42
5.2. Data projection ...................................................................................................................44
5.2.1. What is a projection? ................................................................................................... 44
5.2.2. Why use projections? .................................................................................................. 44
5.3. Principal component analysis ..............................................................................................45
5.3.1. How it works ............................................................................................................... 45
5.3.2. Output ........................................................................................................................ 46
5.3.3. Scree plot ................................................................................................................... 47
5.3.4. Usage ......................................................................................................................... 48
5.3.5. PCA simplifies data ..................................................................................................... 48
5.3.6. Example: possum dataset............................................................................................ 48
5.3.7. Example: nutrition dataset ........................................................................................... 49
5.3.8. Example: influenza ...................................................................................................... 50
5.3.9. Example: enterotypes .................................................................................................. 50


2

, 5.4. T-SNE .................................................................................................................................51
5.4.1. Perplexity .................................................................................................................... 52
5.4.2. t-SNE for single cell RNAseq ........................................................................................ 53
5.5. UMAP .................................................................................................................................53
5.6. UMAP vs t-SNE ....................................................................................................................53

Chapter 6: Linear models ............................................................................................................. 54
6.1. Simple linear regression.......................................................................................................54
6.2. Multiple linear regression .....................................................................................................54
6.3. Supervised learning .............................................................................................................55
6.4. Linear models .....................................................................................................................56
6.4.1. One way ANOVA .......................................................................................................... 56
6.4.2. ANCOVA ..................................................................................................................... 57
6.4.3. Mixed model ............................................................................................................... 58
6.4.4. Akaike information criterion ......................................................................................... 59
6.4.5. Elastic net ................................................................................................................... 59
6.4.6. Regression example .................................................................................................... 60
6.5. Generalised linear models ...................................................................................................60
6.5.1. Linear-response model ................................................................................................ 60
6.5.2. Generalised linear mixed model ................................................................................... 63

Chapter 7: Molecular data analysis .............................................................................................. 64
7.1. Introduction ........................................................................................................................64
7.1.1. Quantitative profiles .................................................................................................... 64
7.1.2. The q-omics data matrix .............................................................................................. 64
7.1.3. Workflow quantitative profiles ...................................................................................... 64
7.2. Transcriptomics ..................................................................................................................65
7.2.1. Expression value variability .......................................................................................... 65
7.2.2. RNAseq introduction ................................................................................................... 65
7.3. DiGerential analysis .............................................................................................................67
7.3.1. Two sample t-test ........................................................................................................ 67
7.3.2. Linear model ............................................................................................................... 67
7.3.3. Diferential analysis ..................................................................................................... 70
7.4. Expression downstream analysis .........................................................................................70
7.5. Proteomics .........................................................................................................................70
7.5.1. Relative vs absolute..................................................................................................... 70
7.5.2. Dynamic range of proteins is a challenge ...................................................................... 71
7.5.3. Three ‘schools’ ............................................................................................................ 71
7.5.4. Protein quantity variability............................................................................................ 71
7.5.5. Quantitative LC/MS processing .................................................................................... 71
7.6. Protein identification ...........................................................................................................72
7.6.1. Tandem MS ................................................................................................................. 72
7.6.2. Quantitative proteomics .............................................................................................. 72
7.6.3. Feature aggregation ..................................................................................................... 73
7.6.4. Example CPTAC........................................................................................................... 73
7.6.5. Example: missing values .............................................................................................. 73
7.6.6. Example: PCA ............................................................................................................. 73

3

, 7.6.7. Example: diferential analysis ...................................................................................... 74
7.7. Metagenomics ....................................................................................................................74
7.7.1. Introduction ................................................................................................................ 74
7.7.2. Binning ....................................................................................................................... 74
7.8. Flow cytometry ...................................................................................................................75
7.8.1. Multiparametric flow cytometry ................................................................................... 75




4

,Chapter 1: Introduction
People have always studied our human body from a multidisciplinary perspective. One perspective is
the perspective on all the molecules of which our body consists of. Our knowledge can be based on
larger amounts of data than ever before. Avalanche of data pushing biomedical science into big data.

Also ar<ficial intelligence is used in data analysis. Trying to use a computer to dig out pa?erns and
knowledge.
• Large scale data and AI brought a new data intensive research paradigm


1.1. Big data
= It is data for which conventional computer-techniques are not sufficient anymore due to size,
complexity, ... Some increases in big data require exponentially more computer capacity. It is a
disruptive trend in computer sciences. It makes us do things that could not be done before.

= moment that you can’t open up your data set in excel and you can’t do anything with it
=> so you need different techniques to study these data sets (one of these techniques is AI)

Big data has 4 important characteris<cs: volume, velocity, variety and veracity.


1.1.1 Volume
= Size of data set that we are working with
• Extremely cheap to generate these amount of data
o The costs of sequencing the human genome is decreasing due to Moore’s law


1.1.2 Velocity
= Speed at which we are genera<ng new data
= data collected at enormous speed
• Smartphone: example of con<nuously genera<ng and collec<ng data
• Data management gap: we are genera<ng way more data but we don’t have the amount of
people to analyse it
• There is the need for new, effec<ve, high-tech data transfer approach
o F.e. put data on a hard drive to transfer data
• Dynamic molecular profiles such as transcriptome profiling, sequencing the immune system,
single cell sequencing


1.1.3 Variety
= Different types of data that are added to our data sets
= Heterogeneous and lots of unstructured data
• Meaning there is not a single type of data
• The data is not simply organized in a simple matrix. A typical example is text or images. You
need context to be able to make sense of it.
• The huge diversity in data types includes DNA sequences, protein structures, gene regulation,
interactions, morphology and metabolism.
5

, 1.1.4 Veracity
= How reliable is our data? How precise where our measurements?,…
• It is a problem in life sciences. There is a lot of heterogeneity in how certain we are of certain
data points.
• There are a lot of potential biases, uncertainties, artefacts,... possible.
• Missing data can also occur and this can be a problem for data mining



1.2 What is data?
Large scale data and AI brought a new data intensive research
paradigm => Data science.
• Collection to “unify statistics, data analysis, informatics, and
their related methods" in order to "understand and analyse
actual phenomena" with data.

= Data is a collec<on of data objects (= samples) and their features
• A feature is a property or characteris<c of an object
o Example: eye colour of a person, temperature, …
o A?ribute is also known as variable, field, characteris<c
or feature
• A collec<on of features describe an object
o Object is also known as record point, case sample, en<ty or instance




1.2.1 Feature values
= are numbers or symbols assigned to a feature
• Dis<nc<on between feature and feature value
o Same feature can be mapped to different feature values
§ Example: height can be measured in feet or meters
o Different features can be mapped to the same set of values
§ Example: a?ribute values for ID and age are integers
o However proper<es of feature values can s<ll be different
§ Example: ID has no limit but age has a max and min value




6
$7.78
Get access to the full document:

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Get to know the seller
Seller avatar
WillemsenAmber
4.0
(1)

Also available in package deal

Get to know the seller

Seller avatar
WillemsenAmber Universiteit Antwerpen
Follow You need to be logged in order to follow users or courses
Sold
8
Member since
1 year
Number of followers
0
Documents
42
Last sold
1 week ago

4.0

1 reviews

5
0
4
1
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions