CSC 533 Data Mining Final Exam
Questions and Answers
metadata - Answer-data defining the warehouse objects. metadata repo provides
details regarding the warehouse structure, data history, algorithms used for
summarization, mappings from the source data to the warehouse form, system
performance, and business terms and issues
multidimensional data model - Answer-typically used for the design of corporate data
warehouses and departmental data marts. such a model can adopt a star schema,
snowflake schema, or fact constellation schema. the core is the data cube which
consists of a large set of facts or measures and a number of dimensions
dimensions - Answer-are the entities or perspectives with respect to which an
organization wants to keep records and are hierarchical in nature
data cube - Answer-consists of a lattice of cuboids each corresponding to a different
degree of summarization of the given multidimensional data
concept hierarchies - Answer-organize the values of attributes or dimensions into
gradual abstraction levels
Online Analytical Processing (OLAP) - Answer-can be performed in data
warehouses/marts using the multidimensional data model. Operations include roll-
up, and drill-(down, across, through), slice and dice, and pivot (rotate). Manipulation
of information to create business intelligence in support of strategic decision making
data warehouses - Answer-used for information processing (querying and reporting),
analytical processing (which allows users to navigate through summarized and
detailed data by OLAP operations) and data mining which supports knowledge
discovery.
multidimensional data mining - Answer-also known as exploratory multidimensional
data mining, online analytical mining, or OLAM
OLAP server types - Answer-relational OLAP, multidimensional OLAP, or a hybrid
OLAP implementation. A ROLAP server uses an extended relational DBMS that
maps OLAP operations on multidimensional data to standard relational operations. A
MOLAP server maps multidimensional data views directly to array structures. A
HOLAP server combines ROLAP and MOLAP. For example, it may use ROLAP for
historic data while maintaining frequently accessed data in a separate MOLAP store
Full Materialization - Answer-refers to the computation of all the cuboids in the lattice
defining a data cube. It typically requires an excessive amount of storage space,
particularly as the number of dimensions and size of associated concept hierarchies
grow. This problem is known as the curse of dimensionality. Alternatively, partial
materialization is the selective computation of a subset of the cuboids or subcubes in
the lattice. For example, an iceberg cube is a data cube that stores only those cube
, cells that have an aggregate value (e.g., count) above some minimum support
threshold.
Bitmap indexing - Answer-each attribute has its own bitmap index table. Bitmap
indexing reduces join, aggregation, and comparison operations to bit arithmetic.
Join indexing - Answer-registers the joinable rows of two or more relations from a
relational database, reducing the overall cost of OLAP join operations
Bitmapped join indexing - Answer-combines the bitmap and join index methods, can
be used to further speed up OLAP query processing.
Data generalization - Answer-is a process that abstracts a large set of task-relevant
data in a database from a relatively low conceptual level to higher conceptual levels.
Data generalization approaches include data cube-based data aggregation and
attribute-oriented induction
Concept description - Answer-is the most basic form of descriptive data mining. It
describes a given set of task-relevant data in a concise and summarative manner,
presenting interesting general properties of the data. Concept (or class) description
consists of characterization and comparison (or discrimination). The former
summarizes and describes a data collection, called the target class, whereas the
latter summarizes and distinguishes one data collection, called the target class, from
other data collection(s), collectively called the contrasting class(es).
Concept characterization - Answer-can be implemented using data cube (OLAP-
based) approaches and the attribute-oriented induction approach. These are
attributeor dimension-based generalization approaches. The attribute-oriented
induction approach consists of the following techniques: data focusing, data
generalization by attribute removal or attribute generalization, count and aggregate
value accumulation, attribute generalization control, and generalization data
visualization.
Concept comparison - Answer-can be performed using the attribute-oriented
induction or data cube approaches in a manner similar to concept characterization.
Generalized tuples from the target and contrasting classes can be quantitatively
compared and contrasted.
Data quality - Answer-is defined in terms of accuracy, completeness, consistency,
timeliness, believability, and interpretabilty. These qualities are assessed based on
the intended use of the data.
Data cleaning - Answer-routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Data cleaning is
usually performed as an iterative two-step process consisting of discrepancy
detection and data transformation
Data integration - Answer-combines data from multiple sources to form a coherent
data store. The resolution of semantic heterogeneity, metadata, correlation analysis,
Questions and Answers
metadata - Answer-data defining the warehouse objects. metadata repo provides
details regarding the warehouse structure, data history, algorithms used for
summarization, mappings from the source data to the warehouse form, system
performance, and business terms and issues
multidimensional data model - Answer-typically used for the design of corporate data
warehouses and departmental data marts. such a model can adopt a star schema,
snowflake schema, or fact constellation schema. the core is the data cube which
consists of a large set of facts or measures and a number of dimensions
dimensions - Answer-are the entities or perspectives with respect to which an
organization wants to keep records and are hierarchical in nature
data cube - Answer-consists of a lattice of cuboids each corresponding to a different
degree of summarization of the given multidimensional data
concept hierarchies - Answer-organize the values of attributes or dimensions into
gradual abstraction levels
Online Analytical Processing (OLAP) - Answer-can be performed in data
warehouses/marts using the multidimensional data model. Operations include roll-
up, and drill-(down, across, through), slice and dice, and pivot (rotate). Manipulation
of information to create business intelligence in support of strategic decision making
data warehouses - Answer-used for information processing (querying and reporting),
analytical processing (which allows users to navigate through summarized and
detailed data by OLAP operations) and data mining which supports knowledge
discovery.
multidimensional data mining - Answer-also known as exploratory multidimensional
data mining, online analytical mining, or OLAM
OLAP server types - Answer-relational OLAP, multidimensional OLAP, or a hybrid
OLAP implementation. A ROLAP server uses an extended relational DBMS that
maps OLAP operations on multidimensional data to standard relational operations. A
MOLAP server maps multidimensional data views directly to array structures. A
HOLAP server combines ROLAP and MOLAP. For example, it may use ROLAP for
historic data while maintaining frequently accessed data in a separate MOLAP store
Full Materialization - Answer-refers to the computation of all the cuboids in the lattice
defining a data cube. It typically requires an excessive amount of storage space,
particularly as the number of dimensions and size of associated concept hierarchies
grow. This problem is known as the curse of dimensionality. Alternatively, partial
materialization is the selective computation of a subset of the cuboids or subcubes in
the lattice. For example, an iceberg cube is a data cube that stores only those cube
, cells that have an aggregate value (e.g., count) above some minimum support
threshold.
Bitmap indexing - Answer-each attribute has its own bitmap index table. Bitmap
indexing reduces join, aggregation, and comparison operations to bit arithmetic.
Join indexing - Answer-registers the joinable rows of two or more relations from a
relational database, reducing the overall cost of OLAP join operations
Bitmapped join indexing - Answer-combines the bitmap and join index methods, can
be used to further speed up OLAP query processing.
Data generalization - Answer-is a process that abstracts a large set of task-relevant
data in a database from a relatively low conceptual level to higher conceptual levels.
Data generalization approaches include data cube-based data aggregation and
attribute-oriented induction
Concept description - Answer-is the most basic form of descriptive data mining. It
describes a given set of task-relevant data in a concise and summarative manner,
presenting interesting general properties of the data. Concept (or class) description
consists of characterization and comparison (or discrimination). The former
summarizes and describes a data collection, called the target class, whereas the
latter summarizes and distinguishes one data collection, called the target class, from
other data collection(s), collectively called the contrasting class(es).
Concept characterization - Answer-can be implemented using data cube (OLAP-
based) approaches and the attribute-oriented induction approach. These are
attributeor dimension-based generalization approaches. The attribute-oriented
induction approach consists of the following techniques: data focusing, data
generalization by attribute removal or attribute generalization, count and aggregate
value accumulation, attribute generalization control, and generalization data
visualization.
Concept comparison - Answer-can be performed using the attribute-oriented
induction or data cube approaches in a manner similar to concept characterization.
Generalized tuples from the target and contrasting classes can be quantitatively
compared and contrasted.
Data quality - Answer-is defined in terms of accuracy, completeness, consistency,
timeliness, believability, and interpretabilty. These qualities are assessed based on
the intended use of the data.
Data cleaning - Answer-routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Data cleaning is
usually performed as an iterative two-step process consisting of discrepancy
detection and data transformation
Data integration - Answer-combines data from multiple sources to form a coherent
data store. The resolution of semantic heterogeneity, metadata, correlation analysis,