CS 229 Supervised Learning Cheatsheet Updated For
2025/2026
Linear Regression
1. Hypothesis:
hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx
2. MSE Loss:
J(θ)=12m∑i=1m(y(i)−θTx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T
x^{(i)})^2J(θ)=2m1i=1∑m(y(i)−θTx(i))2
3. Normal Equation (MLE):
θ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Tyθ^=(XTX)−1XTy
4. Regularized Normal Equation:
θ^=(XTX+λI)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Tyθ^=(XTX+λI)−1XTy
5. Probabilistic model (Gaussian noise):
,2|Page
p(y∣x;θ)=N(y∣θTx,σ2)p(y|x;\theta) =
\mathcal{N}(y|\theta^Tx,\sigma^2)p(y∣x;θ)=N(y∣θTx,σ2)
Logistic Regression
6. Sigmoid:
σ(z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}}σ(z)=1+e−z1
7. Hypothesis:
hθ(x)=σ(θTx)h_\theta(x) = \sigma(\theta^T x)hθ(x)=σ(θTx)
8. Log-likelihood:
ℓ(θ)=∑i=1m(y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i))))\ell(\theta) = \sum_{i=1}^m
\Big(y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\Big)ℓ(θ)=i=1∑m
(y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i))))
9. Gradient:
, 3|Page
∇θℓ(θ)=∑i=1m(y(i)−hθ(x(i)))x(i)\nabla_\theta \ell(\theta) = \sum_{i=1}^m (y^{(i)} -
h_\theta(x^{(i)})) x^{(i)}∇θℓ(θ)=i=1∑m(y(i)−hθ(x(i)))x(i)
10. Hessian:
H=−∑i=1mhθ(x(i))(1−hθ(x(i)))x(i)x(i)TH = -\sum_{i=1}^m h_\theta(x^{(i)})(1-
h_\theta(x^{(i)})) x^{(i)}{x^{(i)}}^TH=−i=1∑mhθ(x(i))(1−hθ(x(i)))x(i)x(i)T
11. MAP with Gaussian prior:
ℓMAP(θ)=ℓ(θ)−λ2∥θ∥2\ell_{MAP}(\theta) = \ell(\theta) -
\frac{\lambda}{2}\|\theta\|^2ℓMAP(θ)=ℓ(θ)−2λ∥θ∥2
Generalized Linear Models (GLMs)
12. Canonical form:
p(y;η)=b(y)exp(ηTT(y)−a(η))p(y;\eta) = b(y)\exp(\eta^T T(y) -
a(\eta))p(y;η)=b(y)exp(ηTT(y)−a(η))
13. Link function: