100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.6 TrustPilot
logo-home
Summary

Summary Cheatsheet Final Exam Machine Learning

Rating
4.5
(2)
Sold
15
Pages
2
Uploaded on
16-02-2020
Written in
2019/2020

Detailed cheatsheet with all important notes for the final exam of the course Machine Learning. I passed the exam by only studying this cheatsheet.

Institution
Course








Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
February 16, 2020
Number of pages
2
Written in
2019/2020
Type
Summary

Subjects

Content preview

Introduction. Machine Learning: learning to solve problems from Perceptron. Computes weighted sum
examples, come up with algorithms after learning and applying Binary of input features (plus bias), if sum >= 0,
classification: either x-y(positive-negative) or x-nonx(spam-nonspam). ROC outputs +1, otherwise outputs -1. Linear
curve: plots precision vs recall (sensitivity vs specificity). Cross validation: Classifier: simplest linear model, finding
break traindata in 10 parts, train on 9 and test on 1. LOO: cross validation simple boundaries (separating +1 and -
when K=N-1. Good for KNN, otherwise expensive to run. Confidence: 1). Discriminant: f(x) = w · x + b. Bias:
confidence of 95% means if you reran 100 times, 95 of decides which class the node should be
these would do better. Debugging: collect data, choose pushed to, does not depend on input
features, choose model fam, choose traindata, train value. When w · x = 0, bias decides which class to predict  makes default
model, evaluate on test. Canonical Learning decision  biases classifier towards positive or negative class. In the
Problems: regression, binary classification, multiclass beginning it is all random, after iterating weights and biases are gradually
Gradient Descent. Model that uses inputs to predict outputs. Gradient
classification, multi-label classification, ranking, shifted so next result is closer to desired output. Error-driven: it is online, Descent to find model parameters with lowest error, is not limited to linear
sequence labelling, sequence-to-sequence labelling, looks at one example at a time. If doing well, doesn’t update parameters models only. Optimization algorithm: how it learns: model + optimization.
autonomous behaviour. MSE: average square of (only when error occurs). Finding (w,b): go through all examples, try with Optimization means finding a minimum or maximum of a function. However,
difference between true and pred value. MAE: average absolute difference current (w,b), if correct  continue, otherwise  adjust (w,b). Streaming optimizing zero/one loss is hard. An option is to concoct an S-shaped
between true and pred value. FP: no spam, marked as spam. FN: not data: data which does not stop coming (recordings from sensors, social function which is smooth and potentially easier to optimize, but it is not
marked as spam, is spam. TP: marked as spam, is spam. TN: not marked media posts, news articles). convex. Convex function: looks like a happy face, easy to minimize. It’s
as spam, is spam. Accuracy: number of correct predictions, (TP + TN) / (P Online: online learners like the perceptron are good for streaming data. always non-negative. Concave function: looks like a sad face. Surrogate
+ N) 1 – error rate. Error: proportion of mistakes. Precision: of all found x, Online algorithm only remembers current example. Can imitate batch Loss Functions: hinge loss, logistic loss, exponential loss, squared loss.
how many were actually x? P = TP / marked. Recall: of all x out there, how learning by iterating over data several times in order to extract more SSE: Sum of Squared Errors. Used for
many found x? R = TP / spam. F-score: harmonic mean between precision information from it. Evaluation Online Learning: predict current example  measuring error. Find w value: start with
and recall, F1 = 2 * ((P * R) / (P + R)). Macro Average: compute precision record correct or not  update model (if necessary)  next example. random value for w  check slope of
and recall per class, take average. Micro Average: correct prediction as TP, Always checks error rate, and never evaluate/test on examples which are function  descend the slope  adjust w to
missing classification as FN, incorrect prediction as FP. used for traindata. Early stopping: stop training when error on validation decrease f(w). First Derivative: if we
data stops dropping. When training error goes down, but validation goes up define f(w) = w2, the first derivative is
Decision Trees/Forest. Generalization: ability to view something new in a  over fitting. Sparsity: a sparse representation which omits zero values. f’(w) = 2w. Slope: describes steepness
related way. Goal induction: take traindata, use to induce function ‘f’, evaluate ‘f’ of a single dimension. Gradient is the
on testdata. Succeeds if performance on testdata is high. Advantages of DT: collection of slopes, one for each dimension. To compute: first derivative 
transparent, easily understandable, fast (no revision needed). Disadvantages of for function ‘f’, first derivative van be written f’  then f’(a) is the slope of
DT: intricate treeshape, depends on minor details, over fitting, try limiting the depth. function f at point a. Basic Gradient Descent: for f(w)  w2. Ready to
Building DT: number of possible trees grows exponentially with number of Descent: initialize w to some
features, needs to be built incrementally. Ask the most important questions first, so value (e.g 10)  update 
the ones which help us classify. Left branch  apply algorithm to NO examples, N is the learning rate,
right branch  apply algorithm to YES examples. Recursion: function that calls controlling speed of descent  stop when w does not change anymore. If
learning rate is too big  we will get further away from solution instead of
itself until some base case is reached (otherwise would continue infinitely). Base
Feature Engineering. Process of transforming raw data into features that closer. Stochastic Gradient Descent: randomized gradient descent, works
case = leaf node, recursive call = left/right subtree. (Un)balanced Trees: balanced
better with large datasets. Momentum: large momentum = difficult to
trees are ‘better’  faster, depends on depth of tree. Prediction time does not better represent the underlying problem to the predictive models, resulting in
change direction. A modification to SGD which smooths gradient estimates
depend on number of questions, but on number of unique combinations. improving model accuracy for unseen data. It gets the most out of your data.
Discretization: use quantiles as threshold or choose thresholds present in data. with memory. No modification to learning rate. Finding Derivatives: in the
Algorithms are generic, features are specific. Feature engineering is often a
Measure Impurity: to find the best split general case: symbolic or automated differentiation  get gradients for
major part of machine learning. Categorical vectors: some algorithms (decision
condition (quality of question), stops where no complicated functions composed of differentiable operations  automatic
trees/random forests) can easily use categorical features such as occupation or
application of chain rule (Tensorflow, PyTorch). Local Minima: can get your
improvement is possible. Entropy IH(P): nationality. Otherwise  convert to numerical. Feature engineering: extracting
measure of uniformity of distribution. More optimizer trapped. Potential problem for non-linear models (such as neural
features, transforming features, selecting features. Feature engineering is
uniform  more uncertainty (and thus data is networks). Not really problem in high-dimensional data. In most cases don’t
domain specific, and domain expertise is needed. Common Feature Sources:
not divided enough). Tries to minimize care about local minima. Simplest way to avoid  restart from different
text, visual, audio, sensors, surveys. Feature transformations: standardizing (z-
uniformity. Gini Impurity IG(P): measuring how often a random element would be starting point which is more accessible. While searching for the global min,
scoring), log-transform, polynomial features (combining features). Text
labelled incorrectly if labels were assigned randomly. Random Forest: many DT’s, model can encounter many ‘valleys’ and the bottoms we call local minimum.
Features: word counts, word ngram-counts, character ngram-counts, word
Depending on model, if the valley is deep enough, the process might get
randomly distributing features over different trees, increased generalizability, vectors. MEG: signal amplitude at number of locations on surface of channels,
stuck there and we end up with local min instead of global which means that
variance is lower, but interpretability is worse. evolving in time. Feature Ablation Analysis: remove one feature at a time 
we end up with less than optimal cost. Not necessarily a big problem in high
measure drop in accuracy  quantifies contribution of feature, given all other
dimensional data. Less likely that there is a decrease in the error function in
features. Feature Learning: unsupervised learning: word vectors (LSA,
any direction if the parameter space is high, so there should be less local
word2vec, GloVe). Neural networks can extract features from ‘raw’ inputs while
minima.
learning (speech: audio wave, image=pixels, text=byte sequences). Pairwise
interactions: linear classifiers need information about joint-occurrence. Always
consider the expressiveness of
your model when engineering features.

Reviews from verified buyers

Showing all 2 reviews
5 year ago

5 year ago

4.5

2 reviews

5
1
4
1
3
0
2
0
1
0
Trustworthy reviews on Stuvia

All reviews are made by real Stuvia users after verified purchases.

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
lisajanssen1 Tilburg University
Follow You need to be logged in order to follow users or courses
Sold
35
Member since
9 year
Number of followers
27
Documents
6
Last sold
5 days ago

3.5

4 reviews

5
1
4
1
3
1
2
1
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions