DATA MINING EXAM #2 STUDY GUIDE
QUESTIONS AND ANSWERS
Underfitting - Answer-When there's poor performance on the training data and poor
generalization to other data
What is K in K-cross validation? - Answer-the number of sets the data is split into
How many sets are needed for Cross Validation - Answer-K
What are the main two techniques for cross-validation - Answer--K-folds Cross
Validation
-Leave One Out Cross Validation(LOOCV)
Leave One Out Cross Validation(LOOCV) - Answer-a configuration of k-fold cross-
validation where k is set to the number of examples in the dataset
What do you derive from Descriptive Statistics? - Answer-central tendency,
dispersion, skewness
What does Inferential Statistics describe? - Answer--Makes inference about the
population
-uses hypothesis testing and parameter estimation
Null Hypothesis (H0) - Answer-A statement of "no difference."
Alternative Hypothesis (Ha) - Answer-the hypothesis that a proposed result is true for
the population
Three main categories to categorize regression - Answer--Independent variables
-Regression Line Shape
-Dependent Variable
simple - Answer-1 independent
multiple - Answer-> 1 independent
ridge - Answer-Highly correlated
lasso - Answer-Ridge with variable selection
stepwise - Answer-identification of best variables
linear - Answer-continuous
logistic - Answer-binary
nominal - Answer-2+ categories
, Poisson - Answer-counts
ordinal - Answer-ordered responses
Regression - Answer--supervised learning
-Output: continuous quantity
-Aim: Predict
Classification goals - Answer--supervised learning
-Output: categorical quantity
-Aim: Compute the category of the data
Clustering - Answer--unsupervised learning
-Output: Assigns data points to clusters
-Aim: group similar items into clusters
Support Vector Machine advantages - Answer--SVM works relatively well when there
is a clear margin of separation between classes.
-SVM is more effective in high dimensional spaces.
Support Vector Machine disadvantages - Answer--SVM algorithm is not suitable for
large data sets.
-SVM does not perform very well when the data set has more noise i.e. target
classes are overlapping.
Hyperplane equation - Answer-
Kernel - Answer-a method used to take data as input and transform it into the
required form of processing data
Types of Kernels - Answer--Gaussian
-Sigmoid
-Polynomial
-Linear
What are the types of Naïve Bayes? - Answer--Binomial
-Multinomial
-Gaussian
Type I Error - Answer-false positive
Type II error - Answer-false negative
Accuracy - Answer-refers to how close a measured value is to an accepted value
Error Rate - Answer-percent of misclassified records out of the total records in the
validation data
Recall - Answer-
QUESTIONS AND ANSWERS
Underfitting - Answer-When there's poor performance on the training data and poor
generalization to other data
What is K in K-cross validation? - Answer-the number of sets the data is split into
How many sets are needed for Cross Validation - Answer-K
What are the main two techniques for cross-validation - Answer--K-folds Cross
Validation
-Leave One Out Cross Validation(LOOCV)
Leave One Out Cross Validation(LOOCV) - Answer-a configuration of k-fold cross-
validation where k is set to the number of examples in the dataset
What do you derive from Descriptive Statistics? - Answer-central tendency,
dispersion, skewness
What does Inferential Statistics describe? - Answer--Makes inference about the
population
-uses hypothesis testing and parameter estimation
Null Hypothesis (H0) - Answer-A statement of "no difference."
Alternative Hypothesis (Ha) - Answer-the hypothesis that a proposed result is true for
the population
Three main categories to categorize regression - Answer--Independent variables
-Regression Line Shape
-Dependent Variable
simple - Answer-1 independent
multiple - Answer-> 1 independent
ridge - Answer-Highly correlated
lasso - Answer-Ridge with variable selection
stepwise - Answer-identification of best variables
linear - Answer-continuous
logistic - Answer-binary
nominal - Answer-2+ categories
, Poisson - Answer-counts
ordinal - Answer-ordered responses
Regression - Answer--supervised learning
-Output: continuous quantity
-Aim: Predict
Classification goals - Answer--supervised learning
-Output: categorical quantity
-Aim: Compute the category of the data
Clustering - Answer--unsupervised learning
-Output: Assigns data points to clusters
-Aim: group similar items into clusters
Support Vector Machine advantages - Answer--SVM works relatively well when there
is a clear margin of separation between classes.
-SVM is more effective in high dimensional spaces.
Support Vector Machine disadvantages - Answer--SVM algorithm is not suitable for
large data sets.
-SVM does not perform very well when the data set has more noise i.e. target
classes are overlapping.
Hyperplane equation - Answer-
Kernel - Answer-a method used to take data as input and transform it into the
required form of processing data
Types of Kernels - Answer--Gaussian
-Sigmoid
-Polynomial
-Linear
What are the types of Naïve Bayes? - Answer--Binomial
-Multinomial
-Gaussian
Type I Error - Answer-false positive
Type II error - Answer-false negative
Accuracy - Answer-refers to how close a measured value is to an accepted value
Error Rate - Answer-percent of misclassified records out of the total records in the
validation data
Recall - Answer-