CORRECT ANSWERS
We run two k-means clustering models on the same data with k=3 and k=5. The model with k=3 is
necessarily better than the other one because a smaller value of k is always better for clustering.
✅✅CORRECT ANSW-false
The following chart shows the within-cluster sum of square errors versus the number of clusters in a
k-means clustering model. Based on the Elbow method, what value of k is optimum for clustering?
✅✅CORRECT ANSW-(answer is 4)
How to answer a question like this is you choose the value on the x-axis where the elbow would be
on the arm. essentially where the chart "slows" down
In the k-means clustering technique, the desired number of clusters (k) is a number that is
determined in the middle of the algorithm by calculating the model error. ✅✅CORRECT ANSW-
false
Both numerical and categorical variables can be used in the Euclidian distance function in the k-
means clustering algorithm. ✅✅CORRECT ANSW-false
With the k-NN model for a numerical target, after we determined the k nearest neighbors of a new
data record, how the target value is predicted? ✅✅CORRECT ANSW-Average of the neighbors
Which statement is INCORRECT about the k-means clustering algorithm? ✅✅CORRECT ANSW-The
algorithm starts with initial centroids that are determined by distance function
What is the Euclidean distance between the following two records WITHOUT normalization? Round
your answer to 1 decimal.
Euclidean distance formula: ✅✅CORRECT ANSW-Age1 = 25, Age2 = 30 Credit Score1 = 550, Credit
Score2 = 540 Children1 = 3, Children2 = 2 Savings1 = 4.6, Savings2 = 7.2
Euclidean Distance = sqrt((25 - 30)^2 + (550 - 540)^2 + (3 - 2)^2 + (4.6 - 7.2)^2)
Euclidean Distance = sqrt((-5)^2 + (10)^2 + (1)^2 + (-2.6)^2)
Euclidean Distance = sqrt(25 + 100 + 1 + 6.76)
Euclidean Distance = sqrt(132.76)
, [11.5]
The k-means clustering algorithm can easily handle noisy data with outliers as well as non-convex
data patterns. ✅✅CORRECT ANSW-false
Which statement is INCORRECT about clustering? ✅✅CORRECT ANSW-Clustering is useful for
predicting association rules
Before computing the distance between two data records, we should normalize the numerical
variables to prevent variables with large scales from having an undue effect. ✅✅CORRECT ANSW-
true
Which statement is INCORRECT about choosing the number of clusters in the k-means clustering
method? ✅✅CORRECT ANSW-Maximizing the within-cluster sums of squared errors (WSS) is the
goal when selecting k
Which statement is INCORRECT about k-NN predictive models? ✅✅CORRECT ANSW-Larger values
of k increase the risk of over-fitting
k-nearest neighbor (k-NN) is a supervised method that can be used for predicting categorical or
numerical targets. ✅✅CORRECT ANSW-true
What statement is correct about the k-nearest neighbor (k-NN) method? ✅✅CORRECT ANSW-The
value of k can control model over and underfitting
In the k-nearest neighbor models, increasing the value of k leads to overfitting. ✅✅CORRECT
ANSW-false
You are requested to use a large data set of customers to predict how many days after their first
purchase they will make the second purchase. You can do this by developing a classification model.
✅✅CORRECT ANSW-false
Categorical variables can NOT be used as predictors in the linear regression model. ✅✅CORRECT
ANSW-false