ANSWERS(RATED A+)
What are the stages of a data mining pipeline - ANSWERData understanding
Data preprocessing
Data warehousing
Data modeling
Pattern evaluation
Describe the Data Understanding phase - ANSWERØ What types of data?
Ø What do they look like?
Ø Statistics & visualization
Ø Similarity vs. dissimilarity
Ø General patterns vs. anomalies
Describe the Data Preprocessing phase - ANSWERØ Potential issues with data
• E.g., missing data, errors, inconsistency
Ø Preparing data for the mining process
• Data cleaning, integration, transformation, reduction
Ø No good data, no good data mining!
Describe the Data Warehousing phase - ANSWERØ Data warehouse
• vs. operational data
Ø Data cube & OLAP
• Multi-dimensional data management
Ø Data warehouse architecture
Describe the Data Modeling phase - ANSWERØ Frequent pattern analysis
Ø Classification, prediction
Ø Clustering
Ø Anomaly detection
Ø Trend and evolution analysis
Describe the Pattern Evaluation phase - ANSWERØ Finding interesting patterns from
data
• New, valid, generalizable, useful, explainable
Ø Evaluation metrics
• Accuracy, error rate
• False positive/negative rate
• Efficiency, latency, ...
Ø Model selection
What makes up a dataset? - ANSWERØ A collection of data objects
• E.g., employee records, product catalog, online posts
, Ø Each described by a number of attributes
• Also referred to as features, dimensions, variables
• E.g., employee: name, gender, age, salary, job title
• E.g., online post: user, time, content, #likes, responses
What are the different attribute types? - ANSWERØ A collection of data objects
• E.g., employee records, product catalog, online posts
Ø Each described by a number of attributes
• Also referred to as features, dimensions, variables
• E.g., employee: name, gender, age, salary, job title
• E.g., online post: user, time, content, #likes, responses
What makes up the central tendency of data? - ANSWERØ Mean
Ø Median
Ø Mode
Ø Midrange
• (Max - Min)/2
What is the dispersion of a dataset? - ANSWERØ How much a distribution is stretched
or squeezed
• Range: max - min
• Quartiles: Q1 (25%), Q3 (75%)
• IQR (interquartile range): Q3 - Q1
• Variance
• Standard deviation
What are some approaches to encoding relationships between nominal attributes -
ANSWERØ Similarity
• s = 1 if x = y; otherwise s = 0
Ø Dissimilarity
• d = 0 if x = y; otherwise d = 1 Ø Customized
• E.g., color: white is more similar to silver than red
What are some examples of data transformation? - ANSWERØ Smoothing: noise
removal/reduction
Ø Aggregation: e.g., cities => state (n-to-1)
Ø Generalization: e.g., city => state (1-to-1)
Ø Normalization: feature scaling
Ø Discretization: continuous => intervals
Ø Attribute construction from existing ones
What are some types of normalization? - ANSWERØ Rescaling (Min-max
normalization)
Ø Mean normalization
Ø Standardization (z-score normalization)