Data Mining Test 1 Questions with
Complete Solutions
Pattern Evaluation - Answer-To identify the truly interesting patterns representing
knowledge based on interestingness measures.
knowledge presentation - Answer-where visualization and knowledge representation
techniques are used to present mined knowledge to users.
5-Number summary - Answer-Consists of the following: Minimum, Quartile 1 (Q1),
Median, Quartile 3 (Q3) and Max.
Data sets with one, two, or three modes are respectively called: - Answer-Uni-modal,
Bi-modal, and Tri-modal.
How is the Interquartile Range calculated? - Answer-Quartile 3 minus Quartile 1
(IQR = Q3 - Q1).
What are the primary factors that comprise data quality? - Answer-Accuracy,
completeness, consistency, timeliness,
believability, and interpretability
Data quality - Accuracy - Answer-Inaccurate, incomplete, and inconsistent data. Can
be caused by faulty instruments during data recording, human or computer error, or
user entered disguised missing data (intentional inaccurately entered data)
Data quality - Completeness - Answer-Missing data. Can be caused due to data that
is unavailable. Also can be caused by neglect to record data if it was not considered
useful at the time of recording, equipment malfunctions, etc.
Data quality - timeliness - Answer-The process in which data is recorded consistently
can impact the quality of the data. For example, imagine sales representatives
submitting sales records at different intervals which causes inaccuracy in data to
determine sales bonuses for top performing Sales rep. employees.
Data Quality - Believability - Answer-Reflects how much the data are trusted by
users.
Data Quality - Interpretability - Answer-Reflects how easy the data are understood.
Machine learning - Answer-investigates how computers can learn or improve their
performance based on data.
Supervised learning - Answer-Basically a synonym for classification. The supervision
in the learning comes from the labeled examples in the training data set. For
example, in the postal code recognition problem, a set of handwritten postal code
images and their corresponding machine-readable translations are used as the
training examples, which supervise the learning of the classification model
, Unsupervised learning - Answer-Essentially a synonym for clustering. The learning
process is unsupervised since the input examples are not class labeled. Typically,
we may use clustering to discover classes within the data. For example, an
unsupervised learning method can take, as input, a set of images of handwritten
digits. Suppose that it finds
10 clusters of data. These clusters may correspond to the 10 distinct digits of 0 to 9,
respectively.
Semi-supervised learning - Answer-A class of machine learning techniques that
make use
of both labeled and unlabeled examples when learning a model. In one approach,
labeled examples are used to learn class models and unlabeled examples are used
to refine the boundaries between classes. For a two-class problem, we can think of
the set of examples belonging to one class as the positive examples and those
belonging to the other class as the negative examples.
Active learning - Answer-machine learning approach that lets users play an active
role in the learning process. An active learning approach can ask a user (e.g., a
domain expert) to label an example, which may be from a set of unlabeled examples
or synthesized by the learning program.
Outlier - Answer-A data set may contain objects that do not comply with the general
behavior or model of the data.
Data discrimination - Answer-a comparison of the general features of the target class
data objects against the general features of objects from one or multiple contrasting
classes. The target and contrasting classes can be specified by a user, and the
corresponding data objects can be retrieved through database queries.
Data cube - Answer-A multidimensional data structure in which each dimension
corresponds to an attribute or a set of attributes in the schema, and each cell stores
the value of some aggregate measure such as count.
Cluster Analysis - Answer-Analyzes data objects without consulting class labels.
Clustering can be used to generate class labels for a group of data. clusters of
objects are formed so that objects within a cluster have high similarity in comparison
to one another, but are rather dissimilar to objects in other clusters.
Outlier Analysis - Answer-Rather than discarding outliers as noise, they can be used
in to observe interesting behaviors. A typical application could be fraud detection.
What are the 6 methods to handle Missing Values? - Answer-1. Ignore the tuple.
2. Fill in the missing value manually.
3. Use a global constant to fill in the missing value.
4.Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value.
5.Use the attribute mean or median for all samples belonging to the same class as
the given tuple.
Complete Solutions
Pattern Evaluation - Answer-To identify the truly interesting patterns representing
knowledge based on interestingness measures.
knowledge presentation - Answer-where visualization and knowledge representation
techniques are used to present mined knowledge to users.
5-Number summary - Answer-Consists of the following: Minimum, Quartile 1 (Q1),
Median, Quartile 3 (Q3) and Max.
Data sets with one, two, or three modes are respectively called: - Answer-Uni-modal,
Bi-modal, and Tri-modal.
How is the Interquartile Range calculated? - Answer-Quartile 3 minus Quartile 1
(IQR = Q3 - Q1).
What are the primary factors that comprise data quality? - Answer-Accuracy,
completeness, consistency, timeliness,
believability, and interpretability
Data quality - Accuracy - Answer-Inaccurate, incomplete, and inconsistent data. Can
be caused by faulty instruments during data recording, human or computer error, or
user entered disguised missing data (intentional inaccurately entered data)
Data quality - Completeness - Answer-Missing data. Can be caused due to data that
is unavailable. Also can be caused by neglect to record data if it was not considered
useful at the time of recording, equipment malfunctions, etc.
Data quality - timeliness - Answer-The process in which data is recorded consistently
can impact the quality of the data. For example, imagine sales representatives
submitting sales records at different intervals which causes inaccuracy in data to
determine sales bonuses for top performing Sales rep. employees.
Data Quality - Believability - Answer-Reflects how much the data are trusted by
users.
Data Quality - Interpretability - Answer-Reflects how easy the data are understood.
Machine learning - Answer-investigates how computers can learn or improve their
performance based on data.
Supervised learning - Answer-Basically a synonym for classification. The supervision
in the learning comes from the labeled examples in the training data set. For
example, in the postal code recognition problem, a set of handwritten postal code
images and their corresponding machine-readable translations are used as the
training examples, which supervise the learning of the classification model
, Unsupervised learning - Answer-Essentially a synonym for clustering. The learning
process is unsupervised since the input examples are not class labeled. Typically,
we may use clustering to discover classes within the data. For example, an
unsupervised learning method can take, as input, a set of images of handwritten
digits. Suppose that it finds
10 clusters of data. These clusters may correspond to the 10 distinct digits of 0 to 9,
respectively.
Semi-supervised learning - Answer-A class of machine learning techniques that
make use
of both labeled and unlabeled examples when learning a model. In one approach,
labeled examples are used to learn class models and unlabeled examples are used
to refine the boundaries between classes. For a two-class problem, we can think of
the set of examples belonging to one class as the positive examples and those
belonging to the other class as the negative examples.
Active learning - Answer-machine learning approach that lets users play an active
role in the learning process. An active learning approach can ask a user (e.g., a
domain expert) to label an example, which may be from a set of unlabeled examples
or synthesized by the learning program.
Outlier - Answer-A data set may contain objects that do not comply with the general
behavior or model of the data.
Data discrimination - Answer-a comparison of the general features of the target class
data objects against the general features of objects from one or multiple contrasting
classes. The target and contrasting classes can be specified by a user, and the
corresponding data objects can be retrieved through database queries.
Data cube - Answer-A multidimensional data structure in which each dimension
corresponds to an attribute or a set of attributes in the schema, and each cell stores
the value of some aggregate measure such as count.
Cluster Analysis - Answer-Analyzes data objects without consulting class labels.
Clustering can be used to generate class labels for a group of data. clusters of
objects are formed so that objects within a cluster have high similarity in comparison
to one another, but are rather dissimilar to objects in other clusters.
Outlier Analysis - Answer-Rather than discarding outliers as noise, they can be used
in to observe interesting behaviors. A typical application could be fraud detection.
What are the 6 methods to handle Missing Values? - Answer-1. Ignore the tuple.
2. Fill in the missing value manually.
3. Use a global constant to fill in the missing value.
4.Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value.
5.Use the attribute mean or median for all samples belonging to the same class as
the given tuple.