Fina exam QUESTIONS ANSWERS & RATIONALES.
ISYE 6501 - Midterm 1
A large value of K will lead to
a large variance in predictios
Setting a large value of k will ...
lead to a large model bias.
What are real effects?
Real relationships between attributes and responses. They are the same in all data sets,
What are random effects?
They are random but look like real effects. They are different in all data sets.
Why can't we measure a model's effectiveness on data it was trained on?
The model's performance on its training data is usually too optimistic, the model is fit to both real and
random pattenrs in the data, so it becomes overly specialized to the specific randomness in the training
set, that doesn't exist in other data.
If we use the same data to fit a model as we do to estimate how good it is, what is likely to happen?
The model will appear to be better than it really is.
The model will be fit to both real and random patterns in the data. The model's effectiveness on this
data set will include both types of patterns, but its true effectiveness on other data sets (with different
random patterns) will only include the real patterns
When comparing models, if we use the same data to pick the best model as we do to estimate how
good the best one is, what is likely to happen?
The model will appear to be better than it really is.
,The model with the highest measured performance is likely to be both good and lucky in its fit to
random patterns.
What is a training set used for
used to fit the models
What is a validation set used for?
used to choose best model
Why would we use two sets?
Reason to use two different sets is because if the first set, the training set, had unique random effects
that the classifer was designed for, we wouldn't be counting those benefits when we measure
effectiveness on the validation set.
What effects does randomness have on training /validation performance?
sometimes the randomness will make the performance look worse than it really is, and sometimes the
randomness will make the performance look better than it really is
how are high-performing models affected by randomness?
They are often boosted by above average random effects making it look better
what is a test data set used for?
to estimate performance of chosen model
When do we need a validation set?
When we are choosing between multiple models.
What are the data splits when working with one model?
70-90% training, 10-30% test
What are the data splits when comparing models?
50-70% training, split the rest between validation and test
What are two methods of splitting data?
random and roation
What is the rotation method of splitting data?
You take turns selecting points.
5 data point rotation sequence: (Training - Validation - Training - Test - Training
What is the advantage of rotation over randomness?
We make sure each part of the data is equally separated.
,What is the disadvantage of using rotation?
We have to make sure we aren't creating some other type of bias when we assign points.
what is k-fold cross validation?
split the training/validation data into k-parts; we train on k-1 parts and validate on the remaining part.
What metric do you use for k-fold cross validation when comparing models?
The average of all k evaluations.
What do we use when important data only appears in the validation or test sets?
cross-validation
What do we do after we've performed cross-validation?
We train the model again using all the data.
what are the benefits of k-fold cross validation?
better use of data, better estimate of model quality, and chooses model more effectively
What can clustering be used for?
grouping data points (e.g., market segmentation) and discovering groups in data points (e.g.,
personalized medicine
Which should we use most of the data for: training, validation, or test?
training
In k-fold cross-validation, how many times is each part of the data used for training, and for
validation?
k-1 times for training, and 1 time for validation
what is rectangular distance useful for?
calculating driving distance when the city is mapped in a grid
what is the value of p for euclidean distance
2
what is the general equation for p-norm distance
2-norm
Straight-line distance corresponds to which distance metric?
How do you find the distance of an infinity norm?
You find the largest | x_i - y_i |
, What is a centroid
the center of a cluster
What are the steps of k means?
0. Pick k clusters within range of data.
1. Assign each data point to nearest cluster center
2. Recalculate cluster centers (centroids)
3. Repeat 1 and 2 until no changes
How do we find the cluster centers?
We take the mean of all the data points in cluster.
Why is k-means an expectation-maximization
finding the mean of all the points in cluster is similar to finding an expectation.
Assigning data points to cluster centers is the maximization step. Really we are minimizing, but we could
think of it as maximizing the negative of the distance to a cluster center
What are some of the consequences of outliers in k-means?
It will drag the cluster center artificially to one side.
Because k-means is a heuristic and thus fast what can we do?
run it several times choosing different clusters centers and choose the best one and we can choose
different values of k
how does bias/variance change as k changes in KNN
the higher the k the higher the bias the lower the k the higher the variance. when K = 1 that is the most
complex model and thus likely to overfit the data.
How do we find the best value of k in k means?
Elbow method: we calculate the total distance of each data point to its cluster center and plot it in two
dimensions. We look for the kik in the graph.
When clustering for prediction how do we choose the prediction?
When we see a new point, we just choose whichever cluster center is closest.
What is the difference between classification and clustering?
With classification mdoels, we know each data point's attributes and we already know the right
classification for the data points (supervised). In clustering (unsupervised) we know the attributes but
we don't know what group any of these data points are in.
What is the difference between supervised learning and unsupervised learning?