Data Mining Exam Questions with
Verified Answers
T/F : Regression analysis is a poor way to show the relationship between the
dependent variable and independent variable(s) - Answer-False: Regression
analysis is one of the best ways to show the relationship between the two types of
variables
1.Out of the six core ideas in data mining, which are associated with unsupervised
learning algorithms?
a.) Association rules, classification, data reduction, data exploration
b.) Data reduction, prediction, data visualization, association rules
c.) Association rules, data visualization, data exploration, data reduction
d.) Prediction, data reduction, data exploration, classification - Answer-C:
Unsupervised learning algorithms are those used where there is no outcome variable
to predict or classify.
T/F: Training data refers to that portion of the data used to assess how well the
model fits. - Answer-False: Training data refers to that portion of the data used to fit
a model. Validation data refers to that portion of the data used to assess how well
the model fits.
True/False: The first step in trying to reduce the number of predictors should always
be to use domain knowledge - Answer-TRUE: This is the first step because it is very
important to understand what the various predictors are measuring and why. By
using domain knowledge, the user can ensure he or she has condensed the data to
a manageable level. This will make finding the solution much easier.
Which of the following is NOT a step to be taken in a typical data mining effort?
A) Develop an understanding of the purpose of the data mining project
B) Obtain the dataset to be used in analysis
C) Use algorithms to perform the task
D) Determine the data mining task
E) Use regression model techniques to manipulate the data - Answer-E) Use
regression model techniques to manipulate the data, Regression model techniques
are only one of the possible techniques that can be used. Step 6 is "Choose the data
mining techniques to be used." This includes regression, neural nets, hierarchical
clustering, etc. Even though regression is included, it is not the only possible
technique.
, Multiple regression involves which variables
a. One dependent and one independent
b) Two or more dependent variables
c) Two or more independent variables
d) None of the above - Answer-C: regression analysis involving two or more
independent variables is called multiple regression
True or False: Two solutions to handling missing data are omission and Imputation -
Answer-True: Most algorithms will not process records with missing values. One can
use omission to omit the missing values if there or only a few, or one can use
imputation to replace the missing values with reasonable substitutes.
Which partition is used to develop multiple models and is also usually the largest
partition?
(A.) Validation
(B.) Test
(C.) Normalizing
(D.) Training - Answer-(D.) Training; The training partition is typically the largest
used to develop multiple models. It contains the data we use to build the models.
True or False: A good predictive model is one that fits the data closely. - Answer-
False: A good predictive model predicts new cases accurately, whereas an
explanatory model fits data closely.
Categorical is one type of variable. Which of the following is not a categorical
variable?
A. Hair color
B. Gender
C. Integer
D. Political affiliation - Answer-C: There are two types of variables, categorical and
numeric. Categorical would be ordered (low,medium,high) or unordered (male or
female). Numeric variables are variables that are continuous or integers.
True/ False: The equation that describes how y is related to x and the error term is
called the regression model? - Answer-True: the simple linear regression model is
y=Bo+B1X+E. B0 and B1 are called parameters of the model and E is a random
variable called
True or False: Data Mining is a scientific approach to managerial decision making in
which raw data are processed and manipulated to produce meaningful information? -
Answer-True: In order to make those decisions you must extract data from large data
sets. With data analysis you can detect meaningful patterns and rules, ultimately
finding meaningful correlations, patterns, and trends.
The variable being predicted is called the ? , while variables being used to predict
the value are called the ? .
A: Independent variable, denoted by y / Dependent variables, denoted by x
Verified Answers
T/F : Regression analysis is a poor way to show the relationship between the
dependent variable and independent variable(s) - Answer-False: Regression
analysis is one of the best ways to show the relationship between the two types of
variables
1.Out of the six core ideas in data mining, which are associated with unsupervised
learning algorithms?
a.) Association rules, classification, data reduction, data exploration
b.) Data reduction, prediction, data visualization, association rules
c.) Association rules, data visualization, data exploration, data reduction
d.) Prediction, data reduction, data exploration, classification - Answer-C:
Unsupervised learning algorithms are those used where there is no outcome variable
to predict or classify.
T/F: Training data refers to that portion of the data used to assess how well the
model fits. - Answer-False: Training data refers to that portion of the data used to fit
a model. Validation data refers to that portion of the data used to assess how well
the model fits.
True/False: The first step in trying to reduce the number of predictors should always
be to use domain knowledge - Answer-TRUE: This is the first step because it is very
important to understand what the various predictors are measuring and why. By
using domain knowledge, the user can ensure he or she has condensed the data to
a manageable level. This will make finding the solution much easier.
Which of the following is NOT a step to be taken in a typical data mining effort?
A) Develop an understanding of the purpose of the data mining project
B) Obtain the dataset to be used in analysis
C) Use algorithms to perform the task
D) Determine the data mining task
E) Use regression model techniques to manipulate the data - Answer-E) Use
regression model techniques to manipulate the data, Regression model techniques
are only one of the possible techniques that can be used. Step 6 is "Choose the data
mining techniques to be used." This includes regression, neural nets, hierarchical
clustering, etc. Even though regression is included, it is not the only possible
technique.
, Multiple regression involves which variables
a. One dependent and one independent
b) Two or more dependent variables
c) Two or more independent variables
d) None of the above - Answer-C: regression analysis involving two or more
independent variables is called multiple regression
True or False: Two solutions to handling missing data are omission and Imputation -
Answer-True: Most algorithms will not process records with missing values. One can
use omission to omit the missing values if there or only a few, or one can use
imputation to replace the missing values with reasonable substitutes.
Which partition is used to develop multiple models and is also usually the largest
partition?
(A.) Validation
(B.) Test
(C.) Normalizing
(D.) Training - Answer-(D.) Training; The training partition is typically the largest
used to develop multiple models. It contains the data we use to build the models.
True or False: A good predictive model is one that fits the data closely. - Answer-
False: A good predictive model predicts new cases accurately, whereas an
explanatory model fits data closely.
Categorical is one type of variable. Which of the following is not a categorical
variable?
A. Hair color
B. Gender
C. Integer
D. Political affiliation - Answer-C: There are two types of variables, categorical and
numeric. Categorical would be ordered (low,medium,high) or unordered (male or
female). Numeric variables are variables that are continuous or integers.
True/ False: The equation that describes how y is related to x and the error term is
called the regression model? - Answer-True: the simple linear regression model is
y=Bo+B1X+E. B0 and B1 are called parameters of the model and E is a random
variable called
True or False: Data Mining is a scientific approach to managerial decision making in
which raw data are processed and manipulated to produce meaningful information? -
Answer-True: In order to make those decisions you must extract data from large data
sets. With data analysis you can detect meaningful patterns and rules, ultimately
finding meaningful correlations, patterns, and trends.
The variable being predicted is called the ? , while variables being used to predict
the value are called the ? .
A: Independent variable, denoted by y / Dependent variables, denoted by x