DATA MINING PIPELINE EXAM
QUESTIONS WITH VERIFIED
ANSWERS
What are some examples of attribute selection? - Answer-Ø Forward selection
• Keep adding (most informative) attributes
Ø Backward elimination
• Keep removing (least informative) attributes
Ø Feature engineering
• Domain knowledge, decision tree induction, ...
What are some examples of numerosity reduction? - Answer-Ø Parametric methods
• Assume the data fits a certain model
• Estimate model parameters
• E.g., linear/multi-linear/log-linear regression
Ø Non-parametric methods
• Do not assume a certain model
• Use fewer/smaller data representations
Describe a data warehouse - Answer-William H. Inmon -- "a subject-oriented,
integrated, time-variant, and nonvolatile collection of data in support of
management's decision-making process."
Describe OLTP & OLAP - Answer-Ø Online Transactional Processing (OLTP)
• Transaction-oriented tasks: bank transfer, purchase, ...
• Daily operations: insert, update, delete
Ø Online Analytical Processing (OLAP)
• Complex queries on historical data
• Data analysis for insights and decision making
Describe facts & dimensions in a datawarehouse - Answer-Examples:
Ø Fact: Sales
• Customer, item, time
Ø Dimension: Customer
• Name, address, DOB
Ø Dimension: Time
• Year, month, date
What are some common schemas for data warehousing - Answer-Ø Star schema:
one fact table, multiple dimension tables
Ø Snowflake schema
• one fact table, multiple levels of dimension tables
Ø Fact constellation schema
• multiple fact tables, shared dimension tables
Describe a data cube - Answer-Ø Multi-dimensional data model
• Dimensions: cube attribute
, • E.g., year, product, color
• Facts: numeric measure
• E.g., sales volume/value
What are some data cube operations? - Answer-Ø Roll up: aggregation
• E.g., daily => monthly
Ø Drill down: reverse of roll up
• E.g., North America => USA, Mexico, Canada, ...
Ø Pivot: rotate (visualization)
• E.g., <country, item> => <item, country>
Ø Slicing: select along a single dimension
• E.g., country = "USA"
Ø Dicing: select along multiple dimensions
• E.g., county = "USA", year = "2011 - 2020"
What are the different levels of data cube materialization - Answer-Ø Full
materialization
• Pre-compute all cuboids and cells
Ø No materialization
• No precomputation, on-demand
Ø Partial materialization
• (heuristically) pre-compute some cuboids and cells
Describe ETL Staging - Answer-Ø Extract data from various data sources
Ø Transform data
Ø Load data into the data warehouse
What are the stages of a data mining pipeline - Answer-Data understanding
Data preprocessing
Data warehousing
Data modeling
Pattern evaluation
What makes up the central tendency of data? - Answer-Ø Mean
Ø Median
Ø Mode
Ø Midrange
• (Max - Min)/2
What is the dispersion of a dataset? - Answer-Ø How much a distribution is stretched
or squeezed
• Range: max - min
• Quartiles: Q1 (25%), Q3 (75%)
• IQR (interquartile range): Q3 - Q1
• Variance
• Standard deviation
What are some approaches to encoding relationships between nominal attributes -
Answer-Ø Similarity
• s = 1 if x = y; otherwise s = 0
QUESTIONS WITH VERIFIED
ANSWERS
What are some examples of attribute selection? - Answer-Ø Forward selection
• Keep adding (most informative) attributes
Ø Backward elimination
• Keep removing (least informative) attributes
Ø Feature engineering
• Domain knowledge, decision tree induction, ...
What are some examples of numerosity reduction? - Answer-Ø Parametric methods
• Assume the data fits a certain model
• Estimate model parameters
• E.g., linear/multi-linear/log-linear regression
Ø Non-parametric methods
• Do not assume a certain model
• Use fewer/smaller data representations
Describe a data warehouse - Answer-William H. Inmon -- "a subject-oriented,
integrated, time-variant, and nonvolatile collection of data in support of
management's decision-making process."
Describe OLTP & OLAP - Answer-Ø Online Transactional Processing (OLTP)
• Transaction-oriented tasks: bank transfer, purchase, ...
• Daily operations: insert, update, delete
Ø Online Analytical Processing (OLAP)
• Complex queries on historical data
• Data analysis for insights and decision making
Describe facts & dimensions in a datawarehouse - Answer-Examples:
Ø Fact: Sales
• Customer, item, time
Ø Dimension: Customer
• Name, address, DOB
Ø Dimension: Time
• Year, month, date
What are some common schemas for data warehousing - Answer-Ø Star schema:
one fact table, multiple dimension tables
Ø Snowflake schema
• one fact table, multiple levels of dimension tables
Ø Fact constellation schema
• multiple fact tables, shared dimension tables
Describe a data cube - Answer-Ø Multi-dimensional data model
• Dimensions: cube attribute
, • E.g., year, product, color
• Facts: numeric measure
• E.g., sales volume/value
What are some data cube operations? - Answer-Ø Roll up: aggregation
• E.g., daily => monthly
Ø Drill down: reverse of roll up
• E.g., North America => USA, Mexico, Canada, ...
Ø Pivot: rotate (visualization)
• E.g., <country, item> => <item, country>
Ø Slicing: select along a single dimension
• E.g., country = "USA"
Ø Dicing: select along multiple dimensions
• E.g., county = "USA", year = "2011 - 2020"
What are the different levels of data cube materialization - Answer-Ø Full
materialization
• Pre-compute all cuboids and cells
Ø No materialization
• No precomputation, on-demand
Ø Partial materialization
• (heuristically) pre-compute some cuboids and cells
Describe ETL Staging - Answer-Ø Extract data from various data sources
Ø Transform data
Ø Load data into the data warehouse
What are the stages of a data mining pipeline - Answer-Data understanding
Data preprocessing
Data warehousing
Data modeling
Pattern evaluation
What makes up the central tendency of data? - Answer-Ø Mean
Ø Median
Ø Mode
Ø Midrange
• (Max - Min)/2
What is the dispersion of a dataset? - Answer-Ø How much a distribution is stretched
or squeezed
• Range: max - min
• Quartiles: Q1 (25%), Q3 (75%)
• IQR (interquartile range): Q3 - Q1
• Variance
• Standard deviation
What are some approaches to encoding relationships between nominal attributes -
Answer-Ø Similarity
• s = 1 if x = y; otherwise s = 0