100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Exam (elaborations)

Machine Learning Cheatsheet + 27 Exam Questions (No Answers)

Rating
3.0
(4)
Sold
19
Pages
4
Grade
7-8
Uploaded on
17-01-2025
Written in
2024/2025

Machine Learning Cheatsheet + 27 questions that were asked in the exam. There are no answers included.

Institution
Course








Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
January 17, 2025
Number of pages
4
Written in
2024/2025
Type
Exam (elaborations)
Contains
Only questions

Subjects

Content preview

Introduction to ML Gradient descent Gradient boosted trees
Decision trees use logic-based “if-then” rules. Logistic regression uses Linear regression: target is computed using weights (coefficients: w and Ensemble: multiple models work together to make predictions
weighted features. Regression: predicts a number. (Binary) Classification b) applied to the features. Gradient descent: an optimization algorithm (combine each model with their own prediction). When we combine
categorizes data (spam or not spam). Multi-class classification: (sports or that adjusts model parameters (like weights and intercept) step by step their predictions, the mistakes will cancel each other out. Bagging
finance or politics). Multi-label classification: (jazz and pop). Sequence to minimize the error between predictions and actual values (it can (Bootstrapping Aggregating): technique used in random forests (taking
labeling assigns label to each element in a sequence. Sequence to work in high-dimensional spaces). Goal: find the minimum of a random samples of the data to train each decision tree). Residual:
sequence is (Latin to English). Train: to teach the model. Validation: to function. Stop when the change in w becomes very small. SSE: sum of difference between predictions to actual values. MAE and MSE can be
tune and improve the model during development. Test: for final squared errors, used to evaluate how well the model fits the data. used to measure how well the models are performing. Ensembles has
evaluation. Cross validation: used when dataset is small or splits may be Slope: describes steepness of a single dimension. Derivative: gives the less spread and lower error. Their errors need to be uncorrelated. For
unrepresentative (break traindata in 10 parts, train on 9 and test on 1). slope at a specific point. Gradient: collection of slopes, one for each classification, instead of averaging, we use voting (majority vote).
Data points are randomly assigned to avoid bias. Stratification: all classes dimension. Learning rate: controls the size of the steps (small = slow Gradient boosting: builds a model step by step by starting with a simple
are proportionally represented in training and validation splits. Time- progress, large = risk of going too far). Typical approach is to start with a prediction and then adding small trees that focus on correcting the
series: past data for training and future data for validation. Time-series large learning rate, then decrease it as the model becomes more errors of the previous ones (each tree is fitted to the negative gradient
cross-validation: expand training set while ensuring validation is always refined. A smaller learning rate near the minimum helps fine-tune the (residual) of the loss function, which allows GB to handle different loss
forward-looking. MAE: average absolute difference between predictions model. Update rule: adjust the model’s weights using the slope and functions for various tasks. Negative gradient: shows the direction and
and true values. MSE: squares the differences. Precision: minimizing false learning rate and keeps repeating until the weights stop changing much. size of the adjustment needed to reduce the model’s prediction error as
positives (TP / TP + FP). Recall: minimizing false negative (TP / FN + TP). Disadvantages for large data: computationally expensive, requires more quickly as possible. Squared loss: exaggerate the influence of large
Accuracy: number of correct predictions (TP+TN/Total). Error rate: memory and resources  Solution: use a subset of data or SGD: errors (outliers). Absolute loss: less sensitive to large errors and outliers.
proportion of mistakes (FP+FN/Total). F-score: harmonic mean between Stochastic Gradient Descent, where only one example is used for each Huber loss: using squared for smaller residuals and absolute for larger
precision and recall (2* ((P*R)/P+R)). B = 1 means P and R are equal update (erratic movement, as updates are based on random subsets of residuals. GB in regression: we used decision trees and rely on residuals
important. B >1 means recall is more important. B <1 means precision is the data, but still ends up near the minimum. Advantage: faster and gradients to update model. GB in classification: problem becomes
updates, generalize better and avoid overfitting. Batch gradient more complex, so we use sequential addition of trees (instead of a
more important. descent: moves steadily towards the minimum (smooth path). single tree for all classes, the model builds separate trees for each class,
Decision trees & Forests
Momentum: helps smooth out the noisy updates in SGD, by combining assigning scores rather than labels, and combines them. Softmax: to
Use the best feature that best divides the data by class. A good split has
the current gradient with past updates, with the degree of smoothing convert raw scores into probabilities, ensuring they are between 0 and 1
less impurity. Misclassification impurity: measures mistakes made when
controlled by the parameter B. Local minima: where the model gets and sum to 1 across all classes. (used in the output layer for multi-class
labeling data after a split, can be less effective in cases with more than
stuck, unable to reach the minimum. SGD can help the model avoid problems). True class if represented as a “one-hot” distribution (all
two classes. (calculated as 1 – proportion of majority class). Gini
local minima. In higher-dimensional (like neural networks) local minima probabilities are 0 except for the correct class, which is 1). Cross
impurity: measures the likelihood of a random element being incorrectly
is less of a problem because the structure of the error function is more entropy: measures how well the predicted probabilities match the true
classified. Entropy: measures the uniformity of a distribution (all classes
complex and multidimensional. Autodiff: automatically computes one-hot labels, aiming to minimize the difference using concepts from
equal  entropy is high  uncertainty). Lower impurity = better
derivatives by applying calculus rules to the model’s computation graph. entropy and KL divergence.
question. Trees are built incrementally, one split at a time. Goal: create a
structure that minimizes mistakes and simplifies decision-making. Branch Linear classifiers
node: holds a question. Leaf node: holds the class label. Base case: leaf Perceptron: computes weighted sum of input features (plus bias), if sum >= 0, outputs + 1, otherwise outputs -1. Linear classifier: draws a straight
node. Recursive case: function that calls itself until some base case is line (boundary) to separate data into different groups. Bias: helps the perceptron decide when there’s no information or when all features are zero.
reached. Handling numerical data: convert numerical features into (if misclassified, increase or decrease bias). Weights: if misclassified, adjust weights by adding or subtracting the feature values (x). Batch learning:
binary questions (is the size >= threshold?). Depth of the tree determines models that use the whole dataset at once to train (decision trees). Online learning: models like the perceptron that update one example at a time,
classification speed (balanced trees are faster  classification time grows useful when data is continuously generated (like social media posts). The order of examples affects how the model learns: important to randomize
logarithmically with the number of lead nodes). Advantages of DT: easy data order. Zero-One Loss: counts classification mistakes as 0 (correct classification) or 1 (incorrect). Doesn’t give a slope when used for gradient
to interpret and visualize, especially with smaller trees. Disadvantages of descent, making it unhelpful for learning. Logistic regression: unlike the perceptron, uses probabilities to estimate class likelihoods and minimizes
DT: large trees can become hard to interpret, prone to overfitting (to errors through a loss function, typically trained using gradient descent. It calculates the logit, which is the logarithm of the odds ratio, showing the
prevent overfitting, control the depth of the tree and use pruning or relative likelihood of the positive class versus the negative class. It maps probabilities (0 to 1) to real numbers (from minus infinity to plus infinity).
setting minimum sample sizes for splits). Pruning: removes unnecessary So, w * x + b = logit (let’s say 3.0), then exp is always 2.718. 2.718^3 = 20.079. 1/20.079 = 0.0498. 1/(1+0.0498 = 0.95). Inverse logit (sigmoid):
nodes to simplify the tree. Random forest: a collection of decision trees, transforms real numbers (from negative infinity to positive infinity) into probabilities (0 to 1) (suitable for binary classification). Perceptron only
each tree is trained on different subset of the data, often using majority updates weights when there is an error, logistic regression updates weights based on the difference between predicted probabilities and true labels.
voting for classification. Advantages: reduces overfitting and improves To prevent overfitting: regularization term (L2 regularization), which penalizes large weights: by adding a penalty term to the loss function,
generalization. Disadvantages: less interpretable. controlled by a hyperparameter (alpha). Larger alpha value: encourage simpler models with smaller weights, while smaller values allow the model
to fit the data more closely. SVM: uses hinge loss to create decision boundary that maximizes the margin between classes, similar to loss function.
(Logistic regression: predicts probabilities (output between 0 and 1), Linear regression: predicts continuous numerical values).
$6.57
Get access to the full document:
Purchased by 19 students

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Reviews from verified buyers

Showing all 4 reviews
2 weeks ago

1 month ago

4 months ago

1 week ago

3.0

4 reviews

5
1
4
1
3
0
2
1
1
1
Trustworthy reviews on Stuvia

All reviews are made by real Stuvia users after verified purchases.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
iuk Tilburg University
Follow You need to be logged in order to follow users or courses
Sold
48
Member since
6 year
Number of followers
14
Documents
2
Last sold
4 days ago
iuk notes

3.7

6 reviews

5
3
4
1
3
0
2
1
1
1

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions