100%VERIFIED ANSWERS
k-Nearest Neighbor (KNN) CORRECT ANSWER KNN is a classifier that classifies
data points based on the k number of data points nearest to a data point. To find the
class of a new point, you pick the k closest points (neighbors) to the new one. The new
point's class is the most common among the k neighbors.
The calculation for KNN is much more straightforward with the main parameters being
how distance is calculated (typically straight-line) and what the optimal value for k
should be.
KNN vs SVM CORRECT ANSWER KNN is better when there are more than 2 classes
present. However, SVM is faster at classifying.
Is scaling important in KNN? Why or why not? CORRECT ANSWER Scaling is very
important in KNN since KNN is a distance based algorithm. Without scaling, one feature
would play a much larger impact than the other at determining the closest distance
between data points.
Is it a good idea for you to use predictions from your training set for model validation?
CORRECT ANSWER No. Predictions made from a training data set are often too
optimistic since it is likely that your model is picking up random effects present in
training data. Training data is just used for training the model and should not be used for
deriving how good the model performs.
What are the two types of patterns that exist in data? CORRECT ANSWER Real
effects: real relationships between attributes and the response variable
Random effects: random, but looks like a real effect
Why does fitting a model on different data sets remove the impacts of random effects?
CORRECT ANSWER Real effects are the same in all data sets. If there is truly a
relationship between two variables, then that effect will always be present even if you
change data sets.
,Random effects are different in all data sets. When you change your data set, any
random effects your model picked up in training won't help it when it sees new data with
different random effects.
Are model scores derived from validation data sets typically higher or lower than ones
derived from training data sets? Why or why not? CORRECT ANSWER They are
almost always going to be lower than scores derived from training sets. The predictions
made on a training set contain both real effects and random effects from that data.
When that same model is run on a new validation set, only the model's ability to pick up
real effects should remain.
Training and Validation Sets: What are the objectives of each and which should be
larger or smaller? CORRECT ANSWER Training sets should be larger and are meant
to train the model and have it fitted on. Validation sets should be smaller and meant to
be used for estimating the model's effectiveness.
Training and Validation Sets: Choosing the best model from a group? CORRECT
ANSWER When choosing the best model among a group, you would to use the model
score from the validation set to compare results. However; you would not use the score
from the validation set to measure the model's overall accuracy. You would need to fit
the model against a third testing data set to evaluate its performance.
Why can't you use the validation score to measure a model's performance when
choosing the best model from a group? CORRECT ANSWER It is likely that the model
that performed the best during validation did so because it happened to be better at
picking up the random effects in your validation set than other models. As a result, the
validation score it produced is probably too optimistic.
Model scores are always a sum of fit to real patterns and fit to random patterns. If
several models are pretty close to each other in how well they pick up real patterns, the
deciding factor often becomes how well they fit random patterns in the validation data.
Training, validation, and test sets CORRECT ANSWER Training: trains our model and
is used to fit the models
Validation: used to compare and choose the best model
Test: used to estimate the performance of the chosen model
Note: Validation sets are only used when we are comparing multiple models. If only one
model was built, then we do not need a validation step and just need a Training and
Test set.
Rules of Thumb for Splitting Data CORRECT ANSWER Working with one model:
70%-90% to training and 30%-10% for testing
Comparing models:
50%-70% to training and split the rest equally between validation and testing
, Methods for Splitting Data CORRECT ANSWER 1. Random. Randomly assign points
to different sets
2. Rotation. Take turns selecting points to go to each set.
Random Splitting vs Rotation Splitting CORRECT ANSWER Randomness can give
one set more early or late data, while rotation equally separates data. However, rotation
may introduce bias (ex. 5 data point rotation means all Mondays are in one dataset).
You can make a combined approach that uses parts of both.
What is cross validation and what problem does it solve? CORRECT ANSWER Cross
validation shuffles samples between training sets and validation sets. This solves a
problem when important data points are not present in training sets but are only present
in validation/test sets. Cross validation shuffles training and validation data in a way that
the model incorporates validation data into its build at least once.
k-Fold Cross Validation CORRECT ANSWER 1. We split data into two groups; a test
set and a combined training + validation set
2. We then split the combined training + validation set into k groups
3. Taking the k groups, we take turns assigning one of the groups to be the validation
set and the remaining groups to be the training set. We do this k times so that each
group takes turns being the validation set and every data point is used to train at least k-
1 of the models
4. The average of all k group configurations becomes the total validation score for the
model
After doing k-Fold Cross Validation, how do you decide which model performed best?
CORRECT ANSWER You take the average model score for each model you did k-Fold
Cross Validation on and choose the model that had the highest average. Once you
decide which model is the best, you build the final model using the combined training +
validation set and get a final model score using the test set.
Can you average coefficients across k splits to get your final model from k-Fold Cross
Validation? CORRECT ANSWER No. You get your final model coefficients by building
the model on the combined training + validation set after you have selected which
model type performs best in validation.
Benefits of k-Fold Cross Validation CORRECT ANSWER Better use of data
Better estimate of model quality
Choose model more effectively
Prevents one model from benefitting more than another model from randomness in
validation set.
Models being trained don't miss out on any important data points that may only be
present in the validation set.