DSCI 4520 EXAM 1 SECTION 2
QUESTIONS WITH COMPLETE
SOLUTIONS
Which statement is INCORRECT about choosing the number of clusters in the k-means
clustering method?
A. Maximizing the within-cluster sums of squared errors (WSS) is the goal when
selecting k
B. Sometimes business considerations impose constrains on the value of k
C. Ability to do a useful profiling based on the cluster centroids helps us select a right
value of k
D. Similar analyses can be used to inform our decision about a right value of k -
Answer-Maximizing the within-cluster sums of squared errors (WSS) is the goal when
selecting k
k-nearest neighbor (k-NN) is a supervised method that can be used for predicting
categorical or numerical targets.
True
False - Answer-True
In the k-nearest neighbor models, increasing the value of k leads to overfitting.
True
False - Answer-False
With the k-NN model for a numerical target, after we determined the k nearest
neighbors of a new data record, how the target value is predicted?
A. Majority vote determines the predicted class
B. Average of the neighbors
C. Through a logistic regression between the neighbors
D. Through a linear combination of neighbors - Answer-Average of the neighbors
What statement is correct about the k-nearest neighbor (k-NN) method?
A. Underfitted k-NN models can be fixed by adding a dummy variable for accuracy
B. Logistic regression is a special case of k-NN
C. The value of k can control model over and underfitting
D. Overfitted k-NN models can be fixed by decreasing k - Answer-The value of k can
control model over and underfitting
Which statement is INCORRECT about k-NN predictive models?
A. Larger values of k increase the risk of over-fitting
B. When k=n (number of data records) the k-NN and the universal average methods are
the same
C. k-NN is sensitive to irrelevant features
, D. Finding optimum value of k can be computationally expensive - Answer-Larger
values of k increase the risk of over-fitting
When we are building a linear regression model, against what model do we compare it
to evaluate its significance?
Naïve (average) model
Logistic model
Classification model
Random model - Answer-Naïve (average) model
In a linear regression model, the t-Test for each predictor's coefficient indicates if the
estimated value is significantly different from zero.
True
False - Answer-True
In the development of a linear regression model, what is the naive (based) model that
we compare the performance of the linear model with?
Simple linear model
Average model
Multiple linear model
Random guess - Answer-Average model
In the following scatter plot matrix, Price is the target variable. What predictor shows the
strongest negative correlation with Price?
CC
HP
Age_08_04
Weight - Answer-Age_08_04
The following report shows Excel output for a linear regression model. What can the p-
value of F-statistic tell us?
A. If this p-value is less than our significance level then the coefficients are significant
B. If this p-value is larger than our significance level then the coefficients are significant
C. If this p-value is larger than our significance level then the model as a whole is
significant
D. If this p-value is less than our significance level then the model as a whole is
significant - Answer-If this p-value is less than our significance level then the model as a
whole is significant
We have developed two different linear regression models on the same data set. Which
model shows a better goodness-of-fit?
Not enough information
Models are the same
Model B
Model A - Answer-Model A
QUESTIONS WITH COMPLETE
SOLUTIONS
Which statement is INCORRECT about choosing the number of clusters in the k-means
clustering method?
A. Maximizing the within-cluster sums of squared errors (WSS) is the goal when
selecting k
B. Sometimes business considerations impose constrains on the value of k
C. Ability to do a useful profiling based on the cluster centroids helps us select a right
value of k
D. Similar analyses can be used to inform our decision about a right value of k -
Answer-Maximizing the within-cluster sums of squared errors (WSS) is the goal when
selecting k
k-nearest neighbor (k-NN) is a supervised method that can be used for predicting
categorical or numerical targets.
True
False - Answer-True
In the k-nearest neighbor models, increasing the value of k leads to overfitting.
True
False - Answer-False
With the k-NN model for a numerical target, after we determined the k nearest
neighbors of a new data record, how the target value is predicted?
A. Majority vote determines the predicted class
B. Average of the neighbors
C. Through a logistic regression between the neighbors
D. Through a linear combination of neighbors - Answer-Average of the neighbors
What statement is correct about the k-nearest neighbor (k-NN) method?
A. Underfitted k-NN models can be fixed by adding a dummy variable for accuracy
B. Logistic regression is a special case of k-NN
C. The value of k can control model over and underfitting
D. Overfitted k-NN models can be fixed by decreasing k - Answer-The value of k can
control model over and underfitting
Which statement is INCORRECT about k-NN predictive models?
A. Larger values of k increase the risk of over-fitting
B. When k=n (number of data records) the k-NN and the universal average methods are
the same
C. k-NN is sensitive to irrelevant features
, D. Finding optimum value of k can be computationally expensive - Answer-Larger
values of k increase the risk of over-fitting
When we are building a linear regression model, against what model do we compare it
to evaluate its significance?
Naïve (average) model
Logistic model
Classification model
Random model - Answer-Naïve (average) model
In a linear regression model, the t-Test for each predictor's coefficient indicates if the
estimated value is significantly different from zero.
True
False - Answer-True
In the development of a linear regression model, what is the naive (based) model that
we compare the performance of the linear model with?
Simple linear model
Average model
Multiple linear model
Random guess - Answer-Average model
In the following scatter plot matrix, Price is the target variable. What predictor shows the
strongest negative correlation with Price?
CC
HP
Age_08_04
Weight - Answer-Age_08_04
The following report shows Excel output for a linear regression model. What can the p-
value of F-statistic tell us?
A. If this p-value is less than our significance level then the coefficients are significant
B. If this p-value is larger than our significance level then the coefficients are significant
C. If this p-value is larger than our significance level then the model as a whole is
significant
D. If this p-value is less than our significance level then the model as a whole is
significant - Answer-If this p-value is less than our significance level then the model as a
whole is significant
We have developed two different linear regression models on the same data set. Which
model shows a better goodness-of-fit?
Not enough information
Models are the same
Model B
Model A - Answer-Model A