Resumen

Summary Data Science CheatSheet--Time Saving

Puntuación

Vendido

Páginas

Subido en

27-08-2025

Escrito en

2025/2026

1. Distributions Discrete: Binomial, Geometric, Negative Binomial, Hypergeometric, Poisson. Continuous: Uniform, Normal/Gaussian (Central Limit Theorem, Empirical Rule), Exponential, Gamma. 2. Core Concepts Prediction Error = Bias² + Variance + Irreducible Noise. Bias-Variance Tradeoff: Underfitting vs. Overfitting. Model Types: Parametric vs. Non-Parametric. Cross Validation: k-fold, leave-p-out. 3. Model Evaluation Regression: MSE, SSE, SST, R², Adjusted R². Classification Metrics: Precision, Recall, Specificity, F1, ROC/AUC, PR Curve. 4. Algorithms Linear Regression: OLS, assumptions, multicollinearity (VIF), regularization (LASSO, Ridge, Elastic Net). Logistic Regression: Sigmoid, odds, assumptions. Decision Trees: CART for regression/classification, Gini, Entropy. Random Forest: Bootstrapping, Bagging, Out-of-Bag error, Variable Importance. SVM: Margin maximization, kernels, hinge loss, multiclass strategies. k-NN: Distance measures (Euclidean, Manhattan, Hamming). Clustering: k-means, hierarchical, evaluation (Silhouette, Davies-Bouldin). 5. Dimension Reduction PCA: Eigenvectors, explained variance, Sparse PCA. LDA: Class separation, assumptions. Factor Analysis: Latent factor modeling, scree plot. 6. Natural Language Processing Preprocessing: Tokenization, Lemmatization, Stemming, Stop words. Representations: Bag-of-Words, TF-IDF, Word2Vec, GloVe, BERT. Applications: Sentiment analysis, topic modeling (LDA, LSA). 7. Neural Networks Basics: Perceptron, activation functions (Sigmoid, ReLU, Tanh, Softmax). Training: Loss functions, gradient descent, backpropagation, regularization (dropout, batch norm). CNNs: Convolutions, pooling, architecture. RNNs & LSTMs: Sequential data, vanishing/exploding gradients, gated cells. 8. Advanced Methods Boosting: AdaBoost, Gradient Boost, XGBoost. Recommender Systems: Content-based, collaborative filtering, matrix factorization. Reinforcement Learning: Q-learning, DQN, Policy Gradients, Actor-Critic. Anomaly Detection: Statistical, density-based (kNN, LOF), tree-based (Isolation Forest), autoencoders, HMM. 9. Time Series Characteristics: Stationarity, trend, seasonality, cyclicality, autocorrelation. Models: Exponential Smoothing, ARIMA, SARIMA, Prophet, GAMs. Cross-validation: Sliding window, forward chaining. 10. Statistics & Experimentation Tests: z-test, t-test, Chi-Square, ANOVA. Errors: Type I (α), Type II (β), power, confidence intervals. A/B Testing: Sample size, MDE, multiple comparisons (Bonferroni), network effects, sequential testing, cohort analysis. 11. Miscellaneous Interpretability: Shapley values, SHAP. Probability Tools: Permutations, combinations, skewness. Likelihood vs. Probability. This cheat sheet is essentially a compressed reference for the full lifecycle of data science: from probability/statistics foundations → ML models → deep learning → NLP → recommender systems → reinforcement learning → experimental design.

Mostrar más Leer menos

Institución

Data Science MS

Grado

Data science MS

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Data science MS
Grado: Data science MS

Información del documento

Subido en: 27 de agosto de 2025
Número de páginas: 5
Escrito en: 2025/2026
Tipo: Resumen

Temas

cheatsheet
data
data science
time saving

Vista previa del contenido

Data Science Cheatsheet

Distributions Model Evaluation Logistic Regression
Discrete Regression Predicts probability that y belongs to a binary class.
1 Estimates β through maximum likelihood estimation (MLE)
ŷ)2
P
Binomial - x successes in n events, each with p probability Mean Squared Error (MSE) = n
(yi −
by fitting a logistic (sigmoid) function to the data. This is
→ n
x n−x
x
p q , with µ = np and σ 2 = npq Sum of Squared Error (SSE) = P (yi − ŷ)2
P
equivalent to minimizing the cross entropy loss. Regularization
– If n = 1, this is a Bernoulli distribution Total Sum of Squares (SST) = (yi − ȳ)2 can be added in the exponent.
Geometric - first success with p probability on the nth trial R2 = 1 − SSE , the proportion of explained y-variability 1
SST P (Y = 1) =
→ q n−1 p, with µ = 1/p and σ 2 = 1−p p2 Note, negative R2 means the model is worse than just 1 + e−(β0 +βx)
Negative Binomial - number of failures before r successes predicting the mean. R2 is not valid for nonlinear models, as The threshold a classifies predictions as either 1 or 0
Hypergeometric - x successes in n draws, no replacement, SSresidual +SSerror 6= SST . Assumptions
−1 – Linear relationship between X and log-odds of Y
from a size N population with X items of that feature Adjusted R2 = 1 − (1 − R2 ) NN −p−1
, which changes only when

N −X

– Independent observations
X
x nX predictors affect R2 above what would be expected by chance
→ n−x
, with µ = – Low multicollinearity
N N
n Classification Odds - output probability can be transformed using
Poisson - number of successes x in a fixed time interval, where Predict Yes Predict No P (Y =1)
x −λ Odds(Y = 1) = 1−P (Y =1) , where P ( 13 ) = 1:2 odds
success occurs at an average rate λ → λ x!
e
, with µ = σ 2 = λ Actual Yes True Positive (1 − β) False Negative (β)
Actual No False Positive (α) True Negative (1 − α) Coefficients are linearly related to odds, such that a one unit
Continuous increase in x1 affects odds by eβ1
TP
Uniform - all values between a and b are equally likely – Precision = T P +F P
, percent correct when predict positive
(b−a)2
−1 2
– Recall, Sensitivity = T PT+F P Decision Trees
1
→ b−a with µ = a+b
2
and σ 2 = or n 12
12
if discrete N
, percent of actual positives
Normal/Gaussian N (µ, σ), Standard Normal Z ∼ N (0, 1) identified correctly (True Positive Rate) Classification and Regression Tree
– Central Limit Theorem - sample mean of i.i.d. data – Specificity = T NT+FN
P
, percent of actual negatives identified CART for regression minimizes SSE by splitting data into
correctly, also 1 - FPR (True Negative Rate) sub-regions and predicting the average value at leaf nodes.
approaches normal distribution
precision·recall
– F1 = 2 precision+recall , useful when classes are imbalanced The complexity parameter cp only keeps splits that reduce loss
– Empirical Rule - 68%, 95%, and 99.7% of values lie within
by at least cp (small cp → deep tree)
one, two, and three standard deviations of the mean ROC Curve - plots TPR vs. FPR for every threshold α. Area
– Normal Approximation - discrete distributions such as Under the Curve measures how likely the model differentiates
Binomial and Poisson can be approximated using z-scores positives and negatives (perfect AUC = 1, baseline = 0.5).
when np, nq, and λ are greater than 10 Precision-Recall Curve - focuses on the correct prediction
Exponential - memoryless time between independent events of the minority class, useful when data is imbalanced
1
occurring at an average rate λ → λe−λx , with µ = λ
Gamma - time until n independent events occurring at an Linear Regression
average rate λ Models linear relationships between a continuous response and
explanatory variables CART for classification minimizes the sum of region impurity,
Concepts Ordinary Least Squares - find β̂ for ŷ = βˆ0 + β̂X + by where pˆi is the probability of a sample being in category i.
solving β̂ = (X T X)−1 X T Y which minimizes the SSE Possible measures, eachP with a max impurity of 0.5.
Prediction Error = Bias2 + Variance + Irreducible Noise
Assumptions – Gini Impurity = 1− (pˆi )2
Bias - wrong assumptions when training → can’t capture P
underlying patterns → underfit – Linear relationship and independent observations – Cross Entropy = − (pˆi )log2 (pˆi )
Variance - sensitive to fluctuations when training→ can’t – Homoscedasticity - error terms have constant variance At each leaf node, CART predicts the most frequent category,
generalize on unseen data → overfit – Errors are uncorrelated and normally distributed assuming false negative and false positive costs are the same.
The bias-variance tradeoff attempts to minimize these two – Low multicollinearity The splitting process handles multicollinearity and outliers.
sources of error, through methods such as: Variance Inflation Factor - measures the severity of Trees are prone to high variance, so tune through CV.
1 2 is found by regressing
– Cross validation to generalize to unseen data multicollinearity → 1−R 2 , where Ri
i Random Forest
– Dimension reduction and feature selection Xi against all other variables (a common VIF cutoff is 10) Trains an ensemble of trees that vote for the final prediction
In all cases, as variance decreases, bias increases. Regularization Bootstrapping - sampling with replacement (will contain
ML models can be divided into two types: Add a penalty λ for large coefficients to the cost function, duplicates), until the sample is as large as the training set
– Parametric - uses a fixed number of parameters with which reduces overfitting. Requires normalized data. Bagging - training independent models on different subsets of
respect to sample size Subset (L0 ): λ||β̂||0 = λ(number of non−zero variables) the data, which reduces variance. Each tree is trained on
– Non-Parametric - uses a flexible number of parameters and – Computationally slow, need to fit 2k models
∼63% of the data, so the out-of-bag 37% can estimate
doesn’t make particular assumptions on the data – Alternatives: forward and backward stepwise selection
P prediction error without resorting to CV.
Cross Validation - validates test error with a subset of LASSO (L1 ): λ||β̂||1 = λ |β̂| Deep trees may overfit, but adding more trees does not cause
training data, and selects parameters to maximize average – Shrinks coefficients to zero, and is robust to outliers overfitting. Model bias is always equal to one of its individual
Ridge (L2 ): λ||β̂||2 = λ (β̂)2
P
performance trees.
– k-fold - divide data into k groups, and use one to validate – Reduces effects of multicollinearity Variable Importance - ranks variables by their ability to
– leave-p-out - use p samples to validate and the rest to train Combining LASSO and Ridge gives Elastic Net minimize error when split upon, averaged across all trees

, .Support Vector Machines .Clustering .Dimension Reduction
Separates data between two classes by maximizing the margin Unsupervised, non-parametric methods that groups similar High-dimensional data can lead to the curse of dimensionality,
between the hyperplane and the nearest data points of any data points together based on distance which increases the risk of overfitting and decreases the value
class. Relies on the following: added. The number of samples for each feature combination
k-Means quickly becomes sparse, reducing model performance.
Randomly place k centroids across normalized data, and assig
observations to the nearest centroid. Recalculate centroids as
Principal Component Analysis
the mean of assignments and repeat until convergence. Using Projects data onto orthogonal vectors that maximize variance.
the median or medoid (actual data point) may be more robust Remember, given an n × n matrix A, a nonzero vector ~ x, and
to noise and outliers. k-modes is used for categorical data. a scaler λ, if A~
x = λ~
x then ~
x and λ are an eigenvector and
k-means++ - improves selection of initial clusters eigenvalue of A. In PCA, the eigenvectors are uncorrelated
and represent principal components.
1. Pick the first center randomly
1. Start with the covariance matrix of standardized data
2. Compute distance between points and the nearest center
2. Calculate eigenvalues and eigenvectors using SVD or
3. Choose new center using a weighted probability eigendecomposition
distribution proportional to distance 3. Rank the principal components by their proportion of
Support Vector Classifiers - account for outliers through
4. Repeat until k centers are chosen variance explained = P λi
the regularization parameter C, which penalizes λ
misclassifications in the margin by a factor of C > 0 Evaluating the number of clusters and performance:
Data should be linearly related, and for a p-dimensional
Kernel Functions - solve nonlinear problems by computing Silhouette Value - measures how similar a data point is to dataset, there will be p principal components.
the similarity between points a, b and mapping the data to a its own cluster compared to other clusters, and ranges from 1 Note, PCA explains the variance in X, not necessarily Y.
higher dimension. Common functions: (best) to -1 (worst). Sparse PCA - constrains the number of non-zero values in
– Polynomial (ab + r)d Davies-Bouldin Index - ratio of within cluster scatter to each component, reducing susceptibility to noise and
2 between cluster separation, where lower values are better
– Radial e−γ(a−b) , where smaller γ → smoother boundaries improving interpretability
Hinge Loss - max(0, 1 − yi (wT xi − b)), where w is the margin Hierarchical Clustering Linear Discriminant Analysis
width, b is the offset bias, and classes are labeled ±1. Acts as Clusters data into groups using a predominant hierarchy Supervised method that maximizes separation between classes
the cost function for SVM. Note, even a correct prediction Agglomerative Approach and minimizes variance within classes for a labeled dataset
inside the margin gives loss > 0.
1. Each observation starts in its own cluster 1. Compute the mean and variance of each independent
2. Iteratively combine the most similar cluster pairs variable for every class Ci
2. Calculate the within-class (σw2 ) and between-class (σ 2 )
3. Continue until all points are in the same cluster b
variance
Divisive Approach - all points start in one cluster and splits 2 )−1 (σ 2 ) that maximizes Fisher’s
3. Find the matrix W = (σw b
are performed recursively down the hierarchy
signal-to-noise ratio
Linkage Metrics - measure dissimilarity between clusters
4. Rank the discriminant components by their signal-to-noise
and combines them using the minimum linkage value over all
ratio λ
pairwise points in different clusters by comparing:
Note, the number of components is at most C1 − 1
Multiclass Prediction – Single - the distance between the closest pair of points Assumptions
To classify data with 3+ classes C, a common method is to – Complete - the distance between the farthest pair of points – Independent variables are normally distributed
binarize the problem through: – Ward’s - the increase in within-cluster SSE if two clusters – Homoscedasticity - constant variance of error
– One vs. Rest - train a classifier for each class ci by setting were to be combined – Low multicollinearity
ci ’s samples as 1 and all others as 0, and predict the class Dendrogram - plots the full hierarchy of clusters, where the
with the highest confidence score height of a node indicates the dissimilarity between its children
Factor Analysis
– One vs. One - train
C(C−1)
models for each pair of classes, Describes data using a linear combination of k latent factors.
2
and predict the class with the highest number of positive Given a normalized matrix X, it follows the form X = Lf + ,
predictions with factor loadings L and hidden factors f .

k-Nearest Neighbors
Non-parametric method that calculates ŷ using the average
value or most common class of its k-nearest points. For
high-dimensional data, information is lost through equidistant
vectors, so dimension reduction
P is often applied prior to k-NN.
Minkowski Distance = ( |ai − bi |p )1/p
P
– p = 1 gives Manhattan distance |a − bi |
pP i Scree Plot - graphs the eigenvalues of factors (or principal
– p = 2 gives Euclidean distance (ai − bi )2 components) and is used to determine the number of factors to
Hamming Distance - count of the differences between two retain. The ’elbow’ where values level off is often used as the
vectors, often used to compare categorical variables cutoff.

$2.99

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

c.7

Documento también disponible en un lote

Conoce al vendedor

c.7 Icahn School of Medicine at Mount Sinai

Ver perfil

Seguir

Vendido

Miembro desde

4 meses

Número de seguidores

Documentos

Última venta

0.0

0 reseñas

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller c.7. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $2.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 15 years now