Notas de lectura

Machine Learning - Class Notes

Puntuación

Vendido

Páginas

Subido en

13-06-2023

Escrito en

2022/2023

This document includes: a review of data mining, overfitting/underfitting, bias variance tradeoff, discrete random sampling, clustering, hierarchical methods, divisive method, dendrogram, Euclidean distance, k-means clustering, KNN, naive bayes, Bayes' Theorem, Model assessment, resampling, Leave one out cross validation approach, k-fold cross validation, stepwise selection, ridge regression, LASSO, regularized regression models in R, linear discrimination analysis, QDA, SVM, Logistic regression bayes classifier, decision trees, and reinforcement learning.

Mostrar más Leer menos

Institución

Grado

Vista previa del contenido

DNSC 4280 Machine Learning Class Notes
8/29: Introduction
8/31: Review - Data Mining
●
Supervised learning: explain relationship between predictor and target
●
Predictor/explanatory variable/covariates = same
●
Model Fitting
○
training/validation
■
Build model that optimizes performance of training data setoverfittink=n
■
Try to have best fit of training data
■
Prevent under/overfitting
■
Use validation to check which model performs the best, then deploy best
model on test data set
■
Use training to train different models
■
No overlapping info between training and validation data
○
underfitting/overfitting
○
Trade off
■
Predictive accuracy vs interpretability
■
Parsimony vs blackbox
●
Assess performance on validation (hold-out) data
●
Problem of overfitting
○
Fit may look good but it doesn’t perform well on other datasets
●
Training - 80, Validation - 20
○
Validation: test different models
■
Compute MSE for each model to compare performance
■
Choose best model
○
Test data: summary
●
Model Complexity
○
Overfitting
■
It’s too flexible around the main points of the data
■
The points in the data though only represent the training dataset not the
validation or the test datasets
■
Improve performance on testing dataset not just training
■
Model is too complicated
■
Variability of model is large, increase testing MSE but decease training
MSE (focus on testing error)
○
Underfitting - not flexible enough to capture relationships
■
MSE would be very large for testing/training
●
Bias Variance Tradeoff
○
Simple model - bias large, variance small
○
Testing MSE is summation of bias and variance
○
If you use complicated model you will not have bias, prediction will be too
uncertain for future, high variance ○
We want flexibility so that bias and variance are properly controlled
Practice from Assignment 1 (I realized these are available on BB)
Exercise 1: Sequences
x3 = (1, 0, -1, -2)
1:(-2)
x4 = c(“Hellow, “ “, “World”. “!”)
X4 = c(x4, paste(x4, collapse = “”)
X4
X5 = c(TRUE, FALSE, NA, FALSE) ; x6
X6 <- c(rep(1:2), 2), rep (1:2, each = 2)); x6
Exercise 2: Matrix
X <- rbind(1:4, x3, matrix(x2, 2, 4, byrow = TRUE)
X
Lists: List()
-
Extract list info - use double bracket, or a $
9/7: HW 1 Overview
●
Girl what is going on i have no idea lol. All i know is that Pedro said that the homework is
rough
●
Loops <3
○
(f, lower, upper, tol = 1e-6) to find the root of univariate function F on the interval
(upper,lower)
■
Searching for a root between 1 and 2
○
with the precision tolerance <tol defaulted to be a 10^-6 via bisection which
returns a list consisting of root, f.root (f evaluated at root), iter (# of iterations)
■
How many times it takes to find the root
●
Track whether two points are root or not..?
○
Find whether midpoint is a root of function of x .. = 0
■
F(x)= x^3 -x -1
○
Root between two points that =0
○
F(a+b/2)>0 or <0
○
Function value of root = F(x) (Lol)
■
Root = x
●
Discrete Random Sampling
○
Stratified sampling: identically separated ○
Each level contains same proportion as the entire data set
○
Train a model
●
Probability density function
●
Optimization problems
○
Finding maximum of likelihood typically written in a particular form
○
F(x)=X^2-2x-1
■
Minimize f(x)
9/12: Clustering
●
Clustering is an example of undirected data mining techniques
○
It is used to segment the data, or to find islands of similarity within the data
○
Find islands of similarity
●
Can be useful for marketing segmentation
●
Classification of species
●
Portfolio management
○
You want to know which stocks are similar and which arent
●
Clustering techniques
○
K means clustering
○
Agglomerative clustering
○
Decision trees
○
Neural nets
●
Decide how many clusters we want to have before hand, decide criteria to decide what
clusters are best fitting toward the data
●
Calculate variance of clusters, find overall variance within cluster
●
Want variance to be small to find evidence of similarity
●
Want total variance within clusters to be small
●
Find two cluster such that the summation of the two variances are small
●
Total variance within clusters are small
○
As you increase the number of clusters the total variance decreases (stabilizes)
○
Morse and more clusters, you need to explain underlying common pattern in
cluster, hard to explain/interpret
●
Hierarchical Methods -
most popular method
○
Agglomerative Methods
○
Bottom to top method
○
Begin with N clusters - total number of observations, keep trying to merch
clusters based on the distance between all clusters
■
Therefore reducing number of clusters
○
Do this until one cluster is left
●
Divisive Method
○
Top down method
○
Start with all inclusive cluster but then repeatedly divide all datapoints into
smaller clusters, a cluster for each datapoint
●
Dendrogram - calculate pairwise distances between clusters
○
Y axis is distances between clusters ○
Want to find clusters to merge, based on their distance
○
21 and 12, 10 and 13
○
Calculate distance between two
■
Distance between 12 to 10
●
12 to 12
●
21 to 10
●
21 to 13
■
D1 as ameasure to
○
Euclidean distance
■
Draw points on XY plane
■
■
If all variables are categorical you cannot use euclidean distance to
calculate
●
Calculate differences
●
A has one difference b has 0 difference, 1+0=1, so distance is 1
○
Scaling
■
The variables contributions to distance function won’t be based on the
size of the units they are measured in
9/14: Clustering
●
Dendrogram: starts out with the number of observations we have then starts to cluster
each observation together based on the distance from each observation
●
Cannot have a nice visual representation with a large dataset, it is computationally
expensive (DRAWBACK OF CLUSTERING)
●
Interpreting the clusters
○
Summarize descriptive statistics of each cluster
○
Find column means to know what kind of words to use to describe cluster
○
Can use cluster to identify outliers
●
Data are from the same population and are independent and normally distributed
●
If you have one big cluster you may want to refine it to be able to find more pattern in
detail
●
Merge two cloisters based on closest distance - single linkage method
○
May end up getting cluster with long shape

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: George Washington University
Grado: DNSC 4280

Todos documentos para esta materia (3)

Información del documento

Subido en: 13 de junio de 2023
Número de páginas: 32
Escrito en: 2022/2023
Tipo: NOTAS DE LECTURA
Profesor(es): Zhengling qi
Contiene: Todas las clases

Temas

data mining
machine learning
overfitting
bias variance tradeoff
kmeans clustering
knn
naive bayes
linear regression
k fold cross validation
lasso
decision trees

$16.49

Accede al documento completo:

Escrito por estudiantes que aprobaron

Inmediatamente disponible después del pago

Leer en línea o como PDF

Conoce al vendedor

carly4381937

Conoce al vendedor

carly4381937 Johns Hopkins University

Ver perfil

Seguir

Vendido

Miembro desde

3 año

Número de seguidores

Documentos

Última venta

0.0

0 reseñas

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller carly4381937. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $16.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 16 years now