CSC 533 DATA MINING FINAL EXAM
QUESTIONS AND ANSWERS
Concept characterization - Answer-can be implemented using data cube (OLAP-
based) approaches and the attribute-oriented induction approach. These are
attributeor dimension-based generalization approaches. The attribute-oriented
induction approach consists of the following techniques: data focusing, data
generalization by attribute removal or attribute generalization, count and aggregate
value accumulation, attribute generalization control, and generalization data
visualization.
Concept comparison - Answer-can be performed using the attribute-oriented
induction or data cube approaches in a manner similar to concept characterization.
Generalized tuples from the target and contrasting classes can be quantitatively
compared and contrasted.
Data quality - Answer-is defined in terms of accuracy, completeness, consistency,
timeliness, believability, and interpretabilty. These qualities are assessed based on
the intended use of the data.
Data cleaning - Answer-routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Data cleaning is
usually performed as an iterative two-step process consisting of discrepancy
detection and data transformation
Data integration - Answer-combines data from multiple sources to form a coherent
data store. The resolution of semantic heterogeneity, metadata, correlation analysis,
tuple duplication detection, and data conflict detection contribute to smooth data
integration
Data reduction - Answer-n techniques obtain a reduced representation of the data
while minimizing the loss of information content. These include methods of
dimensionality reduction, numerosity reduction, and data compression.
Dimensionality reduction - Answer-reduces the number of random variables or
attributes under consideration. Methods include wavelet transforms, principal
components analysis, attribute subset selection, and attribute creation
Numerosity reduction - Answer-methods use parametric or nonparatmetric models to
obtain smaller representations of the original data. Parametric models store only the
model parameters instead of the actual data. Examples include regression and log-
linear models. Nonparamteric methods include histograms, clustering, sampling, and
data cube aggregation
Data compression - Answer-methods apply transformations to obtain a reduced or
"compressed" representation of the original data. The data reduction is lossless if the
original data can be reconstructed from the compressed data without any loss of
information; otherwise, it is lossy
, Data transformation - Answer-routines convert the data into appropriate forms for
mining. For example, in normalization, attribute data are scaled so as to fall within a
small range such as 0.0 to 1.0. Other examples are data discretization and concept
hierarchy generation.
Data discretization - Answer-transforms numeric data by mapping values to interval
or concept labels. Such methods can be used to automatically generate concept
hierarchies for the data, which allows for mining at multiple levels of granularity.
Discretization techniques include binning, histogram analysis, cluster analysis,
decision tree analysis, and correlation analysis. For nominal data, concept
hierarchies may be generated based on schema definitions as well as the number of
distinct values per attribute.
Data mining - Answer-is the process of discovering interesting patterns from massive
amounts of data. As a knowledge discovery process, it typically involves data
cleaning, data integration, data selection, data transformation, pattern discovery,
pattern evaluation, and knowledge presentation.
metadata - Answer-data defining the warehouse objects. metadata repo provides
details regarding the warehouse structure, data history, algorithms used for
summarization, mappings from the source data to the warehouse form, system
performance, and business terms and issues
multidimensional data model - Answer-typically used for the design of corporate data
warehouses and departmental data marts. such a model can adopt a star schema,
snowflake schema, or fact constellation schema. the core is the data cube which
consists of a large set of facts or measures and a number of dimensions
dimensions - Answer-are the entities or perspectives with respect to which an
organization wants to keep records and are hierarchical in nature
data cube - Answer-consists of a lattice of cuboids each corresponding to a different
degree of summarization of the given multidimensional data
concept hierarchies - Answer-organize the values of attributes or dimensions into
gradual abstraction levels
Online Analytical Processing (OLAP) - Answer-can be performed in data
warehouses/marts using the multidimensional data model. Operations include roll-
up, and drill-(down, across, through), slice and dice, and pivot (rotate). Manipulation
of information to create business intelligence in support of strategic decision making
data warehouses - Answer-used for information processing (querying and reporting),
analytical processing (which allows users to navigate through summarized and
detailed data by OLAP operations) and data mining which supports knowledge
discovery.
multidimensional data mining - Answer-also known as exploratory multidimensional
data mining, online analytical mining, or OLAM
QUESTIONS AND ANSWERS
Concept characterization - Answer-can be implemented using data cube (OLAP-
based) approaches and the attribute-oriented induction approach. These are
attributeor dimension-based generalization approaches. The attribute-oriented
induction approach consists of the following techniques: data focusing, data
generalization by attribute removal or attribute generalization, count and aggregate
value accumulation, attribute generalization control, and generalization data
visualization.
Concept comparison - Answer-can be performed using the attribute-oriented
induction or data cube approaches in a manner similar to concept characterization.
Generalized tuples from the target and contrasting classes can be quantitatively
compared and contrasted.
Data quality - Answer-is defined in terms of accuracy, completeness, consistency,
timeliness, believability, and interpretabilty. These qualities are assessed based on
the intended use of the data.
Data cleaning - Answer-routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Data cleaning is
usually performed as an iterative two-step process consisting of discrepancy
detection and data transformation
Data integration - Answer-combines data from multiple sources to form a coherent
data store. The resolution of semantic heterogeneity, metadata, correlation analysis,
tuple duplication detection, and data conflict detection contribute to smooth data
integration
Data reduction - Answer-n techniques obtain a reduced representation of the data
while minimizing the loss of information content. These include methods of
dimensionality reduction, numerosity reduction, and data compression.
Dimensionality reduction - Answer-reduces the number of random variables or
attributes under consideration. Methods include wavelet transforms, principal
components analysis, attribute subset selection, and attribute creation
Numerosity reduction - Answer-methods use parametric or nonparatmetric models to
obtain smaller representations of the original data. Parametric models store only the
model parameters instead of the actual data. Examples include regression and log-
linear models. Nonparamteric methods include histograms, clustering, sampling, and
data cube aggregation
Data compression - Answer-methods apply transformations to obtain a reduced or
"compressed" representation of the original data. The data reduction is lossless if the
original data can be reconstructed from the compressed data without any loss of
information; otherwise, it is lossy
, Data transformation - Answer-routines convert the data into appropriate forms for
mining. For example, in normalization, attribute data are scaled so as to fall within a
small range such as 0.0 to 1.0. Other examples are data discretization and concept
hierarchy generation.
Data discretization - Answer-transforms numeric data by mapping values to interval
or concept labels. Such methods can be used to automatically generate concept
hierarchies for the data, which allows for mining at multiple levels of granularity.
Discretization techniques include binning, histogram analysis, cluster analysis,
decision tree analysis, and correlation analysis. For nominal data, concept
hierarchies may be generated based on schema definitions as well as the number of
distinct values per attribute.
Data mining - Answer-is the process of discovering interesting patterns from massive
amounts of data. As a knowledge discovery process, it typically involves data
cleaning, data integration, data selection, data transformation, pattern discovery,
pattern evaluation, and knowledge presentation.
metadata - Answer-data defining the warehouse objects. metadata repo provides
details regarding the warehouse structure, data history, algorithms used for
summarization, mappings from the source data to the warehouse form, system
performance, and business terms and issues
multidimensional data model - Answer-typically used for the design of corporate data
warehouses and departmental data marts. such a model can adopt a star schema,
snowflake schema, or fact constellation schema. the core is the data cube which
consists of a large set of facts or measures and a number of dimensions
dimensions - Answer-are the entities or perspectives with respect to which an
organization wants to keep records and are hierarchical in nature
data cube - Answer-consists of a lattice of cuboids each corresponding to a different
degree of summarization of the given multidimensional data
concept hierarchies - Answer-organize the values of attributes or dimensions into
gradual abstraction levels
Online Analytical Processing (OLAP) - Answer-can be performed in data
warehouses/marts using the multidimensional data model. Operations include roll-
up, and drill-(down, across, through), slice and dice, and pivot (rotate). Manipulation
of information to create business intelligence in support of strategic decision making
data warehouses - Answer-used for information processing (querying and reporting),
analytical processing (which allows users to navigate through summarized and
detailed data by OLAP operations) and data mining which supports knowledge
discovery.
multidimensional data mining - Answer-also known as exploratory multidimensional
data mining, online analytical mining, or OLAM