DATA MINING EXAM REVIEW
QUESTIONS WITH CORRECT
ANSWERS
Phase 2: Data Collection - Answer--Initial Data Collection
-Data are often collected for purposes unrelated to the current business problem.
This is very common in most companies.
-Proceeds with activities aimed at:
Understand the data: relevance, cost and reliability
-Identifying data quality problems
Phases 1 & 2 - Answer-The initial formulation may not be complete or optimal or
feasible, so multiple iterations may be necessary for an acceptable solution
formulation to appear. The goal is a successful data mining formulation to appear.
The goal is a successful data mining formulation that can be solved later by available
data.
Phase 3: Data preparation - Answer-Can take over 90% of the time!
-Covers all activities to construct the final dataset (data that will be fed into the
modeling tool(s) from the initial raw data
Phase 4: Modeling - Answer-Selecting modeling techniques and calibrating their
paramaters
Typically, there are several techniques for the same data mining problem type.
-Generate the test design, and test the model's quality and validity.
Phase 5: Evaluation - Answer--Review process
-Evaluate performance
-choose the right evaluation metric
Phase 6: Deployment - Answer-Determine how the results need to be utilized
-Who needs to use them?
-How often do they need to be utilized?
Data Preparation Steps on Rattle: - Answer-Step 1: Load data and partition data
Step 2: Recognize the correct type of feature
Step 3: Deal with the missing value
Step 4: Transform feature into the correct form
Step 5: Recognize the correct input, target
, Validation set: - Answer-used to tune parameters in models. Not all modeling
algorithms need a validation set
Test Set: - Answer-To assess the likely future performance of a model (test data
does not participate in the training or parameter tuning steps)
Nominal: - Answer-has two or more categories, but there is no intrinsic ordering to
the categories
Ordinal - Answer-similar to categorical but there is a clear ordering of the variables
Reasons for missing values - Answer--Information is not collected
-Attributes may not be applicable to all cases
Handle missing values - Answer--Delete missing features
-Delete observations with missing values
-Impute
-Replace or treat as category
Imputation - Answer-replacing missing data with the substituted values estimated
from the data set
-mean/ median/ mode imputation (Rattle)
-regression imputation
Normilization - Answer-change the range or distribution of data
Recenter - Answer-Move the distribution such that the mean of the feature is 0
Rescale (usually preferred) - Answer-Scale the feature such that the range is 0 to 1
(Xi-Xmin)/(Xmax-Xmin)
Numeric to categorical - Answer-Discretization (sometimes necessary, depends on
the model): recode data into intervals
Quantiles - Answer-Equal frequency distributed in each bin
Kurtosis - Answer--a measure of "tailedness"
-a useful measure of whether there is a problem with outliers in a data set. Larger
kurtosis indicates a more serious outlier problem
Numeric Attributes, Single Feature Visualization - Answer-Box Plot
Histogram
Cumulative Plot
Numeric Attributes, Pairs of Features Visualization - Answer-Scatter Plot
Categorical Attributes, Single Feature Visualization - Answer-Bar plot, dot plot
QUESTIONS WITH CORRECT
ANSWERS
Phase 2: Data Collection - Answer--Initial Data Collection
-Data are often collected for purposes unrelated to the current business problem.
This is very common in most companies.
-Proceeds with activities aimed at:
Understand the data: relevance, cost and reliability
-Identifying data quality problems
Phases 1 & 2 - Answer-The initial formulation may not be complete or optimal or
feasible, so multiple iterations may be necessary for an acceptable solution
formulation to appear. The goal is a successful data mining formulation to appear.
The goal is a successful data mining formulation that can be solved later by available
data.
Phase 3: Data preparation - Answer-Can take over 90% of the time!
-Covers all activities to construct the final dataset (data that will be fed into the
modeling tool(s) from the initial raw data
Phase 4: Modeling - Answer-Selecting modeling techniques and calibrating their
paramaters
Typically, there are several techniques for the same data mining problem type.
-Generate the test design, and test the model's quality and validity.
Phase 5: Evaluation - Answer--Review process
-Evaluate performance
-choose the right evaluation metric
Phase 6: Deployment - Answer-Determine how the results need to be utilized
-Who needs to use them?
-How often do they need to be utilized?
Data Preparation Steps on Rattle: - Answer-Step 1: Load data and partition data
Step 2: Recognize the correct type of feature
Step 3: Deal with the missing value
Step 4: Transform feature into the correct form
Step 5: Recognize the correct input, target
, Validation set: - Answer-used to tune parameters in models. Not all modeling
algorithms need a validation set
Test Set: - Answer-To assess the likely future performance of a model (test data
does not participate in the training or parameter tuning steps)
Nominal: - Answer-has two or more categories, but there is no intrinsic ordering to
the categories
Ordinal - Answer-similar to categorical but there is a clear ordering of the variables
Reasons for missing values - Answer--Information is not collected
-Attributes may not be applicable to all cases
Handle missing values - Answer--Delete missing features
-Delete observations with missing values
-Impute
-Replace or treat as category
Imputation - Answer-replacing missing data with the substituted values estimated
from the data set
-mean/ median/ mode imputation (Rattle)
-regression imputation
Normilization - Answer-change the range or distribution of data
Recenter - Answer-Move the distribution such that the mean of the feature is 0
Rescale (usually preferred) - Answer-Scale the feature such that the range is 0 to 1
(Xi-Xmin)/(Xmax-Xmin)
Numeric to categorical - Answer-Discretization (sometimes necessary, depends on
the model): recode data into intervals
Quantiles - Answer-Equal frequency distributed in each bin
Kurtosis - Answer--a measure of "tailedness"
-a useful measure of whether there is a problem with outliers in a data set. Larger
kurtosis indicates a more serious outlier problem
Numeric Attributes, Single Feature Visualization - Answer-Box Plot
Histogram
Cumulative Plot
Numeric Attributes, Pairs of Features Visualization - Answer-Scatter Plot
Categorical Attributes, Single Feature Visualization - Answer-Bar plot, dot plot