100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4.2 TrustPilot
logo-home
Resumen

Samenvatting - Advanced Data Analysis

Puntuación
-
Vendido
-
Páginas
113
Subido en
19-10-2022
Escrito en
2020/2021

- Introduction - Processing principles - Data mining - Principal component analysis - Supervised learning - Regression - Machine learning methods

Institución
Grado











Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
19 de octubre de 2022
Número de páginas
113
Escrito en
2020/2021
Tipo
Resumen

Temas

Vista previa del contenido

Advanced data analysis
Contents
Chapter 1 - Introduction .................................................................................................................... 6
A bit of context .............................................................................................................................. 6
Introduction .............................................................................................................................. 6
Characteristics of big data .......................................................................................................... 6
But what is data ............................................................................................................................. 7
Objects and attributes ............................................................................................................... 7
Attribute types .......................................................................................................................... 8
Properties of attributes.............................................................................................................. 8
Discrete vs continuous attributes ............................................................................................... 8
Dataset types ................................................................................................................................ 9
Data mining ................................................................................................................................. 11
General.................................................................................................................................... 11
Is it data mining? ..................................................................................................................... 11
Data mining and statistics ........................................................................................................ 11
Data mining challenges ............................................................................................................ 12
Tasks ........................................................................................................................................... 13
General.................................................................................................................................... 13
Supervised ............................................................................................................................... 13
Unsupervised........................................................................................................................... 14
Data mining applications ............................................................................................................. 15
Overview ..................................................................................................................................... 15
Where are we with data mining now ........................................................................................... 15
Chapter 2 – Processing principles .................................................................................................... 16
Introduction ................................................................................................................................ 16
Unstructured data ................................................................................................................... 16
Common data processing steps ................................................................................................... 17
Overview ................................................................................................................................. 17
Feature extraction ................................................................................................................... 17
Attribute transformation ......................................................................................................... 17
Discretization........................................................................................................................... 18
Aggregation ............................................................................................................................. 18

1

, Noise removal.......................................................................................................................... 18
Identifying outliers................................................................................................................... 19
Sampling .................................................................................................................................. 19
Handling duplicate data ........................................................................................................... 20
Handling missing values ........................................................................................................... 20
Dimensionality reduction ......................................................................................................... 21
Processing steps for specific data types ....................................................................................... 22
Image data............................................................................................................................... 22
Survey data.............................................................................................................................. 23
Sequence data ......................................................................................................................... 23
Text data ................................................................................................................................. 24
Omics data............................................................................................................................... 25
Chapter 3 - Data mining – Unsupervised clustering .......................................................................... 31
Unsupervised vs supervised ..................................................................................................... 31
Introduction ................................................................................................................................ 31
Clustering ................................................................................................................................ 31
Similarity ................................................................................................................................. 32
Dendograms ............................................................................................................................ 34
Hierarchical clustering vs partitional clustering ........................................................................ 36
Hierarchical clustering ................................................................................................................. 36
General.................................................................................................................................... 36
Bottom-up ............................................................................................................................... 37
How do you calculate distance between already existing clusters ............................................ 37
Single linkage = nearest neighbour........................................................................................... 38
Complete linkage = Furthest neighbour ................................................................................... 39
Group average ......................................................................................................................... 39
Ward’s method ........................................................................................................................ 40
Comparison ............................................................................................................................. 40
Partitional clustering ................................................................................................................... 41
General.................................................................................................................................... 41
How many clusters?................................................................................................................. 41
How to tell right number of clusters? ....................................................................................... 41
Objective function: squared error ............................................................................................ 42
k-means steps.......................................................................................................................... 42
Importance of choosing initial centroids .................................................................................. 44
k-means limitations ................................................................................................................. 44

2

, k-means: conclusion ................................................................................................................ 45
Chapter 4 - Principal component analysis ........................................................................................ 46
Introduction ................................................................................................................................ 46
Principal component analysis ................................................................................................... 46
Multivariate data ..................................................................................................................... 46
Basic variable statistics ............................................................................................................ 46
Data transformation ................................................................................................................ 47
Comparison between variables ................................................................................................ 48
Still too many variables ............................................................................................................ 50
Data projection ........................................................................................................................ 50
PCA - Theory ................................................................................................................................ 51
Introduction ............................................................................................................................ 51
How PCA works........................................................................................................................ 51
PCA output .............................................................................................................................. 52
PCA summary .......................................................................................................................... 53
PCA usage ................................................................................................................................ 53
How many PC is enough to cover a data set? ........................................................................... 53
PCA - examples ............................................................................................................................ 54
Possum dataset ....................................................................................................................... 54
Nutrition dataset ..................................................................................................................... 56
B-cell receptor sequencing ....................................................................................................... 59
Metagenomics data ................................................................................................................. 60
t-SNE ........................................................................................................................................... 62
What is t-SNE? ......................................................................................................................... 62
How does t-SNE work?............................................................................................................. 62
PCA vs t-SNE ............................................................................................................................ 63
Perplexity ................................................................................................................................ 63
t-SNE for single cell RNAseq ..................................................................................................... 63
Chapter 5 - Supervised learning ....................................................................................................... 64
Classification problem ................................................................................................................. 64
Cat or dog problem .................................................................................................................. 64
Pigeon problem ....................................................................................................................... 64
Grasshopper problem .............................................................................................................. 64
Regression vs classification ...................................................................................................... 66
Linear classifier ............................................................................................................................ 66
Grasshopper example .............................................................................................................. 67

3

, Decision boundary ................................................................................................................... 67
Examples ................................................................................................................................. 68
Iris dataset ............................................................................................................................... 69
Support vector machine........................................................................................................... 69
Decision value.......................................................................................................................... 70
Classifier overview ................................................................................................................... 71
Estimating the performance of the classifier ................................................................................ 71
Predictive accuracy .................................................................................................................. 71
Class labels .............................................................................................................................. 72
Confusion matrix ..................................................................................................................... 72
Type I error vs type II error ...................................................................................................... 73
Values that can be acquired from confusion matrix ................................................................. 73
Thresholds and accuracy .......................................................................................................... 73
ROC-curve ............................................................................................................................... 75
PR curve – precision recall curve .............................................................................................. 76
ROC vs PR curves ..................................................................................................................... 76
Nearest Neighbour Classifier........................................................................................................ 77
Chapter 6 - Regression..................................................................................................................... 79
Introduction ................................................................................................................................ 79
Introductory example .............................................................................................................. 79
Classification vs regression....................................................................................................... 79
Simple linear regression............................................................................................................... 80
General.................................................................................................................................... 80
Multiple linear regression ............................................................................................................ 80
General.................................................................................................................................... 80
Best fit ..................................................................................................................................... 81
Objective function ................................................................................................................... 81
Evaluation................................................................................................................................ 82
Non-linear regression .................................................................................................................. 83
Logistic regression ................................................................................................................... 83
Overfitting ................................................................................................................................... 83
How do we estimate the capacity of our model to overfit? ...................................................... 84
K-fold cross validation – how do we estimate the accuracy of our model? ............................... 84
Factors to consider when building a model .................................................................................. 85
Speed and scalability ............................................................................................................... 85
Interpretability ........................................................................................................................ 86

4
$9.08
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Conoce al vendedor

Seller avatar
Los indicadores de reputación están sujetos a la cantidad de artículos vendidos por una tarifa y las reseñas que ha recibido por esos documentos. Hay tres niveles: Bronce, Plata y Oro. Cuanto mayor reputación, más podrás confiar en la calidad del trabajo del vendedor.
lizaburdz Universiteit Antwerpen
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
50
Miembro desde
8 año
Número de seguidores
34
Documentos
16
Última venta
2 meses hace

3.3

3 reseñas

5
1
4
1
3
0
2
0
1
1

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes