DATA MINING EXAM QUESTIONS AND
ANSWERS 2025
A k-nearest neighbour model ___ - Answer-Is one of the models in IBM SPSS
Modeler and SAS EM
May be referred to as Case Based Reasoning
Evaluates on k neighbours
One method to possibly reduce the dimensionality of a supervised model is ___ -
Answer-To use the 5th dimension
To use principal components with eigenvalues > 1
To use principal components with negative eigenvalues
One should ___ for clustering - Answer-c &d
Clustering algorithms seek to create clusters such that the ___ is large compared to
the ___ - Answer-Between-cluster variation, within-cluster variation
Clustering requires standardization. One way to do standardization is using the min-
max equation (value of interest - minimum)/Range. Given a particular predictor has a
mean of 10, minimum value of 8 and a maximum value of 12, provide the range and
use the min-max equation to standardize the following two values.
Standardize the value 9:
Standardize the value 11: - Answer-9 = (9-8)/(12-8) = 1/4
11 = (11-8)/(12-8) = 3/4
clustering is in the __ category of data mining - Answer-pg 12 of textbook
When referring to Kohonen/SOM clustering models, SOM is - Answer-Self-
organizing map
the most common clustering method is ___ - Answer-k-means
what is a major difference between the data mining tasks of clustering and
classification - Answer-classification is a supervised data mining task where as
clustering is an unsupervised data mining task
A leaf node of a decision tree - Answer-is not further split into additional branches
one common splitting method for decision trees is - Answer-Gini
Creating classification decision tree models will - Answer-generally use a single
categorical target variable
, an advantage of using a decision tree model would be - Answer-that a decision tree
generates rules that can be easily explained and implemented
which is true of creation of a decision tree? - Answer-Each variable is evaluated at
each node to determine the splitting variable
The same variable may be used for splitting at different locations in the decision tree
Entropy reduction (information gain) is a common splitting method
a stopping criterion in creating a decision tree is when the tree has pure leafs
a common name used for association analysis is ___ - Answer-A & d
Care must be used with association analysis because - Answer-It is easy to generate
many rules many of which may be useless
Two data formats for input into association analysis software are referred to as __ or
__ - Answer-Tabluar, transactional
The confidence for the association rule: if milk, then orange juice, and the confidence
for the association rule: if orange juice, then milk, ____ - Answer-see ppt slide 12
Which statement is true of association analysis? - Answer-see ppt slide 3
(not: association analysis requires a target variable)
Bayesian classifiers work only with ___ predictors - Answer-Categorical
Multiplot uses two different variables and graphs them together. This is good for
business uses because you are able to see the relationship and the trends of two
variables on top of each other.
Link analysis is a way to see how variables are related to each other. It looks to see
what they have in common, and what they don't have in common. This is important
for business use because it helps you quickly identify why relationships do or do not
exist.
When is it appropriate to use a histogram? - Answer-To display "how many" of each
value occur in a data set
When is it appropriate to use a bar graph? - Answer-For categorical data
When is it appropriate to use a scatterplot? - Answer-For demonstrating a
relationship between two numerical variables
When is it appropriate to use a line graph? - Answer-For data involving time series
The beta coefficients of a logistic regression model... - Answer-May be different for a
1 unit change of an independent variable value (say from 3-4 than 5-6, while holding
other model features constant)
ANSWERS 2025
A k-nearest neighbour model ___ - Answer-Is one of the models in IBM SPSS
Modeler and SAS EM
May be referred to as Case Based Reasoning
Evaluates on k neighbours
One method to possibly reduce the dimensionality of a supervised model is ___ -
Answer-To use the 5th dimension
To use principal components with eigenvalues > 1
To use principal components with negative eigenvalues
One should ___ for clustering - Answer-c &d
Clustering algorithms seek to create clusters such that the ___ is large compared to
the ___ - Answer-Between-cluster variation, within-cluster variation
Clustering requires standardization. One way to do standardization is using the min-
max equation (value of interest - minimum)/Range. Given a particular predictor has a
mean of 10, minimum value of 8 and a maximum value of 12, provide the range and
use the min-max equation to standardize the following two values.
Standardize the value 9:
Standardize the value 11: - Answer-9 = (9-8)/(12-8) = 1/4
11 = (11-8)/(12-8) = 3/4
clustering is in the __ category of data mining - Answer-pg 12 of textbook
When referring to Kohonen/SOM clustering models, SOM is - Answer-Self-
organizing map
the most common clustering method is ___ - Answer-k-means
what is a major difference between the data mining tasks of clustering and
classification - Answer-classification is a supervised data mining task where as
clustering is an unsupervised data mining task
A leaf node of a decision tree - Answer-is not further split into additional branches
one common splitting method for decision trees is - Answer-Gini
Creating classification decision tree models will - Answer-generally use a single
categorical target variable
, an advantage of using a decision tree model would be - Answer-that a decision tree
generates rules that can be easily explained and implemented
which is true of creation of a decision tree? - Answer-Each variable is evaluated at
each node to determine the splitting variable
The same variable may be used for splitting at different locations in the decision tree
Entropy reduction (information gain) is a common splitting method
a stopping criterion in creating a decision tree is when the tree has pure leafs
a common name used for association analysis is ___ - Answer-A & d
Care must be used with association analysis because - Answer-It is easy to generate
many rules many of which may be useless
Two data formats for input into association analysis software are referred to as __ or
__ - Answer-Tabluar, transactional
The confidence for the association rule: if milk, then orange juice, and the confidence
for the association rule: if orange juice, then milk, ____ - Answer-see ppt slide 12
Which statement is true of association analysis? - Answer-see ppt slide 3
(not: association analysis requires a target variable)
Bayesian classifiers work only with ___ predictors - Answer-Categorical
Multiplot uses two different variables and graphs them together. This is good for
business uses because you are able to see the relationship and the trends of two
variables on top of each other.
Link analysis is a way to see how variables are related to each other. It looks to see
what they have in common, and what they don't have in common. This is important
for business use because it helps you quickly identify why relationships do or do not
exist.
When is it appropriate to use a histogram? - Answer-To display "how many" of each
value occur in a data set
When is it appropriate to use a bar graph? - Answer-For categorical data
When is it appropriate to use a scatterplot? - Answer-For demonstrating a
relationship between two numerical variables
When is it appropriate to use a line graph? - Answer-For data involving time series
The beta coefficients of a logistic regression model... - Answer-May be different for a
1 unit change of an independent variable value (say from 3-4 than 5-6, while holding
other model features constant)