Data Mining Exam #1 Questions with
Latest Update
co-occurence grouping - Answer-Also known as frequent items mining, association
rule discovery, and market-basket analysis. To find associations between entities
based on transactions involving them.
Examples: Product display, product recommendation, Amazon, etc
Data reduction - Answer-To replace a large data set with a smaller set of data that
contains much of the important information in the large data set. Usually involves
loss of information; trade-off.
Goal of classification: - Answer-find a decision boundary (represented by a model)
that separates one class from the other.
Use of training data - Answer-to find out a model that optimizes a pre-defined
objective
Supervised learning - Answer-training data includes both the input (X) and the target
variable (Y)
Unsupervised learning - Answer-the model is NOT provided with the target variable
(Y) during training
-Classification
-Regression
-Data Reduction - Answer-Supervised Learning Examples:
-Clustering
-Co-occurence Grouping
-Data Reduction - Answer-Unsupervised learning examples
Why CRISP-DM? - Answer-Cross Industry Standard Process for Data Mining
The data mining process must be consistent, reliable and repeatable.
Provides a uniform framework for guidelines, and experience documentation
CRISP-DM process - Answer-Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Phase I Business Understanding: - Answer--Understanding the project objectives
and requirements from a business perspective
, -casting the business problem as one or more DM problems and creating a
preliminary plan to achieve the objectives
Phase 2: Data Collection - Answer--Initial Data Collection
-Data are often collected for purposes unrelated to the current business problem.
This is very common in most companies.
-Proceeds with activities aimed at:
Understand the data: relevance, cost and reliability
-Identifying data quality problems
Phases 1 & 2 - Answer-The initial formulation may not be complete or optimal or
feasible, so multiple iterations may be necessary for an acceptable solution
formulation to appear. The goal is a successful data mining formulation to appear.
The goal is a successful data mining formulation that can be solved later by available
data.
Phase 3: Data preparation - Answer-Can take over 90% of the time!
-Covers all activities to construct the final dataset (data that will be fed into the
modeling tool(s) from the initial raw data
Phase 4: Modeling - Answer-Selecting modeling techniques and calibrating their
paramaters
Typically, there are several techniques for the same data mining problem type.
-Generate the test design, and test the model's quality and validity.
Phase 5: Evaluation - Answer--Review process
-Evaluate performance
-choose the right evaluation metric
Phase 6: Deployment - Answer-Determine how the results need to be utilized
-Who needs to use them?
-How often do they need to be utilized?
Data Preparation Steps on Rattle: - Answer-Step 1: Load data and partition data
Step 2: Recognize the correct type of feature
Step 3: Deal with the missing value
Step 4: Transform feature into the correct form
Step 5: Recognize the correct input, target
Validation set: - Answer-used to tune parameters in models. Not all modeling
algorithms need a validation set
Test Set: - Answer-To assess the likely future performance of a model (test data
does not participate in the training or parameter tuning steps)
Latest Update
co-occurence grouping - Answer-Also known as frequent items mining, association
rule discovery, and market-basket analysis. To find associations between entities
based on transactions involving them.
Examples: Product display, product recommendation, Amazon, etc
Data reduction - Answer-To replace a large data set with a smaller set of data that
contains much of the important information in the large data set. Usually involves
loss of information; trade-off.
Goal of classification: - Answer-find a decision boundary (represented by a model)
that separates one class from the other.
Use of training data - Answer-to find out a model that optimizes a pre-defined
objective
Supervised learning - Answer-training data includes both the input (X) and the target
variable (Y)
Unsupervised learning - Answer-the model is NOT provided with the target variable
(Y) during training
-Classification
-Regression
-Data Reduction - Answer-Supervised Learning Examples:
-Clustering
-Co-occurence Grouping
-Data Reduction - Answer-Unsupervised learning examples
Why CRISP-DM? - Answer-Cross Industry Standard Process for Data Mining
The data mining process must be consistent, reliable and repeatable.
Provides a uniform framework for guidelines, and experience documentation
CRISP-DM process - Answer-Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Phase I Business Understanding: - Answer--Understanding the project objectives
and requirements from a business perspective
, -casting the business problem as one or more DM problems and creating a
preliminary plan to achieve the objectives
Phase 2: Data Collection - Answer--Initial Data Collection
-Data are often collected for purposes unrelated to the current business problem.
This is very common in most companies.
-Proceeds with activities aimed at:
Understand the data: relevance, cost and reliability
-Identifying data quality problems
Phases 1 & 2 - Answer-The initial formulation may not be complete or optimal or
feasible, so multiple iterations may be necessary for an acceptable solution
formulation to appear. The goal is a successful data mining formulation to appear.
The goal is a successful data mining formulation that can be solved later by available
data.
Phase 3: Data preparation - Answer-Can take over 90% of the time!
-Covers all activities to construct the final dataset (data that will be fed into the
modeling tool(s) from the initial raw data
Phase 4: Modeling - Answer-Selecting modeling techniques and calibrating their
paramaters
Typically, there are several techniques for the same data mining problem type.
-Generate the test design, and test the model's quality and validity.
Phase 5: Evaluation - Answer--Review process
-Evaluate performance
-choose the right evaluation metric
Phase 6: Deployment - Answer-Determine how the results need to be utilized
-Who needs to use them?
-How often do they need to be utilized?
Data Preparation Steps on Rattle: - Answer-Step 1: Load data and partition data
Step 2: Recognize the correct type of feature
Step 3: Deal with the missing value
Step 4: Transform feature into the correct form
Step 5: Recognize the correct input, target
Validation set: - Answer-used to tune parameters in models. Not all modeling
algorithms need a validation set
Test Set: - Answer-To assess the likely future performance of a model (test data
does not participate in the training or parameter tuning steps)