QUESTIONS & SOLUTIONS
What is KDD, and what are the basics of KDD? - ANSWERKDD is Knowledge
Discovery in Databases. The basics are extracting valuable insights, patterns, and
knowledge from large datasets.
What is the KDD pipeline? - ANSWERdata selection, pre-processing, transformation,
data mining, interpretation/evaluation
Dimensionalities of Data Mining - ANSWER-data to be mined
- knowledge to be mined (data mining functions)
- techniques utilized
- applications adapted
What is a data sample? - ANSWERA subset of data taken from a larger dataset
What is a dataset? - ANSWERA collection of related data points or instances
representing all available data.
Different categories of attributes (Categorical) - ANSWERNominal- names of things,
categories, states
Binary- nominal attribute with only 2 states (0,1)
Ordinal- values have meaningful order (ranking)
Different categories of attributes (Numeric) - ANSWERInterval- measured on a scale of
equal sized units
Ratio- values are in order of magnitude (
Statistical description of data - ANSWERMotivation: tendencies, variation, spread
Data dispersion: medium, max, min, quantile, outliers, variances
Data Transformation Methods - ANSWER- scaling
- logarithmic transformation
- aggregation
- encoding
- binning
- dimensionality reduction
What is EDA? (Exploratory Data Analysis) - ANSWERAn approach in data analysis to
gain insights in understanding of the data, before formal modeling or hypothesis testing
Motivation of EDA - ANSWERTo explore and summarize the main characteristics,
patterns, and relationships within the data
, EDA Methods - ANSWER- Descriptive
- Data Visualization
- Correlation Analysis
- Outlier detection
- Missing Data Analysis
- Data Transformation
- Dimensionality Reduction
What is confidence interval estimation? - ANSWERA statistical technique used to
estimate a range within which a population parameter is likely to lie with a specified
level of confidence.
What is hypothesis testing? - ANSWERA procedure used to make inferences about a
population based on sample data
What is the relationship between confidence interval and hypothesis testing? -
ANSWERConfidence interval is used to perform hypothesis testing
Two sample t-test vs Two sample z-test - ANSWERT-test is when the population
variance is unknown or sample size is small (n<30), z-test population variance is known
and sample size is large (n>30)
ANOVA (Analysis of Variance) - ANSWERan inferential statistical test for comparing the
means of three or more groups
What is the main procedure of KNN? - ANSWER1. Determine parameter where k=# of
nearest neighbors
2. Calculate distance between new instance and all the training examples
3. Sort the examples by distance and determine nearest neighbors based on the k^th
minimum distance
4. Gather the category Y of the nearest neighbor
5. Use simply majority of the category of the nearest neighbors as the prediction value
of query instance
What are decision trees? Execution? - ANSWER- Uses a flow-chart like tree structure to
make predictions
Execution:
1. preprocess data
2. split data intro training/testing
3. train decision tree model on training data
4. evaluate performance on testing data
What is the Naive Bayes classification? Execution? - ANSWERAssumes all attributes
are conditionally independent, no dependence relation between attributes.