Questions And Answers Fully Solved
When to use machine learning - answersWhen the problem is too complex for
hardcoding rules. When the problem deals with an unstudied phenomenon. When the
problem calls for automating some decision. The problem is changing frequently.
Reasons not to use ML - answersCan't get right or enough data. Problem does not
require learning from the data. Problem solved in other ways. Not ethical.
How can we tell if we need more data? (method) - answers Learning curves. Show the
training and validation accuracy over number of training samples. Once the curves
converge we have enough data.
Normalization - answers Map values onto a range such as [0, 1]
Standardization - answers Rescale feature values to follow a standard normal
distribution
T/F Feature scaling helps with most learning algorithms - answers True, especially for
linear SVM, you should rescale features.
Visualizing data - answers Before getting too far into data cleansing or training a model,
visualize the training data. May yield insights into features to use and types of model.
Reinforcement Learning - answers Agent can interact with its environment to perform
actions and get rewards
Some learning algorithms, such as decision trees, k nearest neighbors, neural
networks, naturally support multiclass classification. For others, such as SVM, we can
transform multiclass classification into a binary classification, either one-vs-rest or one-
vs-one. - answers
One-vs-one. Train c(c-1)/2 binary classifiers. fig to classify class I versus class j. Predict
with all and return the class with the most votes - answers
What makes a nice loss function - answersContinous, differentiable, strictly convex,
smooth.
T/F Neural networks generally have a convex loss function - answers False
Convex - answers Not convex if we can't draw a line between any two points in the
shape without leaving the shape.
Model parameters are determined by the training data through some optimization
procedure. Hyper parameters are set by the machine learning engineer. - answers
If you don't have enough data to leave some aside for validation/testing - answers Use
k-fold validation
Bias - answers Error due to incorrect assumptions in the model. High bias means under
fitting.
Variance - answers Sensitivity to small variation in the training data. High variance if
highly influenced by a few data points, this is overfitting.
Regularization - answers Reduces model complexity, decreases the degrees of
freedom of the model. Helps with overfitting.
Linear regression - answers Given the feature vector x, predict the target y as
accurately as possible. =wax + b where w is the learned weights x is the input vector
(feature vector) and b is the bias term
, Mean Squared Error (OLS) - answersPrediction - actual squared then average of the
squares
Mean Absolute Error - answers More robust to outliers, less mathematically convenient
than OLS
Polynomial regression - answers Add can add x^2, x^3 as features to capture nonlinear
relationships.
L1 regularization - answers Encourages sparsity in the weights. Also known as Lasso
L2 regularization - answers Encourage minimization of the weights towards 0. Known as
ridge regression.
Logistic Regression - answers Does binary classification. Predicting the probability of
the class label. Sigmoid function. Logits are the raw outputs of the function and applying
sigmoid converts them to probabilities.
Binary cross-entropy is the loss function of choice for logistic regression. - answers
Soft ax uses predicted logit score for each class to create a probability distribution over
the labels. - answers
SVM - answers Represented as hyperplane wax - b = 0. The goal in SVM is to find the
optimal w and b values that create a hyperplane that best separates the positive from
negative examples.
Hard-Margin sum. - answers Assumes data is linearly separable. Problem is maximizing
the margin or space between the closest examples of each class called support vectors.
Soft margin - answers Use hinge loss. Minimize margin violations.
Kernel trick - answers Allows us to reflect higher dimensional space transformation only
in the cost function for optimization.
SVM - answers Can perform both linear and non linear classification. Can be used for
regression. Well suited to small datasets. Trains slow.
Decision Tree - answersNonparametric model suitable for classification or regression.
Decision Tree algorithms differ in ... - answersStopping criterion, ways to find the best
split, how they are regularized
When to stop? - answersAll examples classified, cant find feature to split, split does not
significantly improve impurity or entropy. Pruning
Voting - answersPredict label with majority label, predict the mean of regressors when
we have different types of models.
Bagging. - answersBootstrap aggregating. Train many different models of the same
type. Each model is trained on a randomly sampled subset of the data. Statistical mode
as aggregation function for classification.
Random Forests - answersBagging with Decision Trees. Use a random set of features
to decide on best split. Bagging doesn't necessarily imply using different subsets of
features.
Boosting - answersTrain many weak learners, each learn correects the errors of the
previous one.
Stacking. - answersAlternative to voting classifiers /regressors. Train a model to do
aggregation instead of averaging or majority voting (meta model).
Evaluation metrics for regression. - answersMean Squared Error, Mean Absolute Error,
etc.
Confusion matrix - answersShows true positives vs false positives + false negatives etc.
Accuracy - answersTP + TN / TP + TP + FP + FN