DATA MINING: EXAM SET QUESTIONS
AND ANSWERS
In a cross-validation plot, if the first value (from left to right) of cross-validation
relative error to fall under the dotted line is the single-node tree, then we may still
want to select a different value of tree size (and therefore a different value of cp)
because: - Answer-We typically do not want all observations to be predicted as
having the same value of the dependent variable.
Imagine that we used the following loss matrix:
matrix(c(0,1,4,0), nrow=2, ncol=2, byrow=FALSE)
Then: - Answer-We are saying that False Positives are 4 times more costly to us
than False Negatives
How do random forests ensure diversity in observations (that is, in rows of data)? -
Answer-By resampling using bootstrapping
How do random forests ensure diversity in explanatory variables? - Answer-By
randomly selecting which variables to use in each split
The Out-of-Bag (OOB) error gives us a sense of how the model performs in out-of-
sample data because: - Answer-For each row, it is computed using only trees that
did NOT use the data from that row
When R squared is much larger than OSR squared, this is a sign of: - Answer-
overfitting
what does sample.split do? - Answer-ensures that the dependent variable is similarly
distributed in both sets
Two main types of outcome data: - Answer-- continuous (age, price, income):
regression
- categorical (gender, debt default, brand): classification
R squared - Answer-gives a sense of how good our model is- higher is better
- explained variance/total variance
OSR squared - Answer-how the model that we built on the training set performs on
out of sample data (test set)
- 1- SSE/SST
- SSE: sum of square differences between what our model predicts (pred_vals) and
what actually happened
~ SSE= sum((test.dat$pred_vals - test.dat$sales)^2)
- SST: sum of square differences between our benchmark model (mean values in
training set) and what actually happened
~ SST= sum((train.mean - test.dat$sales)^)
, If histogram looks very skewed, you can: - Answer-apply a transformation to the
dependent variable
- ex. take logarithm using the log( ) function (base e, not 10)
checking for multicollinearity - Answer-- only worry about correlations among
explanatory variables
- only handles continuous variables
- cor(dat) for correlation matrix
predict funtion: - Answer-estimate the predicted values on the test set
- test.dat$pred_vals = predict(lin_train, newdata=test.dat)
- type= "response" is how to get predicted probabilities
logistic regression goal - Answer-find best estimates of B0, B1, B2...Bk
- Phat(Y=1) = +e^-(-# +#...)
Log-odds ratio - Answer-log(Odds(Y=1)) = B0 + B1x1 + ... + Bkxk
- linear model that predicts the log-odds of success
- odds = chance of success/chance of failure
- as p approaches 1, odds approach infinity
False positive - Answer-predicted Yhat=1 and actual observed value is Y=0
False negative - Answer-predicted Yhat=0 and actual observed value is Y=1
Given a Regression Tree and a new observation that falls in a certain region of the
predictor space (let's call it "Region X"), we predict the value of the dependent
variable as being equal to - Answer-the average value of the dependent variable in
Region X.
We typically avoid building an excessively large regression tree because, while it
may perform well on the training set, it is likely not to perform very well on new data
(the test set). This idea is best captured by which of the following concepts? -
Answer-overfitting
As the value of the complexity parameter (cp) increases, which of the following can
NOT occur? - Answer-The size of the tree increases
- larger the cp, more demanding of each split--> as cp increases, the same or fewer
splits survive pruning, so the tree size stays the same or decreases
When building a tree using the function "rpart", we specify method="anova" to: -
Answer-Tell R that we want to build a regression tree, as opposed to, for example, a
classification tree (method= "class")
Which of the following statements is true about the differences between the functions
printcp and plotcp? - Answer--printcp presents results by number of split
-plotcp presents results by number of leaves
AND ANSWERS
In a cross-validation plot, if the first value (from left to right) of cross-validation
relative error to fall under the dotted line is the single-node tree, then we may still
want to select a different value of tree size (and therefore a different value of cp)
because: - Answer-We typically do not want all observations to be predicted as
having the same value of the dependent variable.
Imagine that we used the following loss matrix:
matrix(c(0,1,4,0), nrow=2, ncol=2, byrow=FALSE)
Then: - Answer-We are saying that False Positives are 4 times more costly to us
than False Negatives
How do random forests ensure diversity in observations (that is, in rows of data)? -
Answer-By resampling using bootstrapping
How do random forests ensure diversity in explanatory variables? - Answer-By
randomly selecting which variables to use in each split
The Out-of-Bag (OOB) error gives us a sense of how the model performs in out-of-
sample data because: - Answer-For each row, it is computed using only trees that
did NOT use the data from that row
When R squared is much larger than OSR squared, this is a sign of: - Answer-
overfitting
what does sample.split do? - Answer-ensures that the dependent variable is similarly
distributed in both sets
Two main types of outcome data: - Answer-- continuous (age, price, income):
regression
- categorical (gender, debt default, brand): classification
R squared - Answer-gives a sense of how good our model is- higher is better
- explained variance/total variance
OSR squared - Answer-how the model that we built on the training set performs on
out of sample data (test set)
- 1- SSE/SST
- SSE: sum of square differences between what our model predicts (pred_vals) and
what actually happened
~ SSE= sum((test.dat$pred_vals - test.dat$sales)^2)
- SST: sum of square differences between our benchmark model (mean values in
training set) and what actually happened
~ SST= sum((train.mean - test.dat$sales)^)
, If histogram looks very skewed, you can: - Answer-apply a transformation to the
dependent variable
- ex. take logarithm using the log( ) function (base e, not 10)
checking for multicollinearity - Answer-- only worry about correlations among
explanatory variables
- only handles continuous variables
- cor(dat) for correlation matrix
predict funtion: - Answer-estimate the predicted values on the test set
- test.dat$pred_vals = predict(lin_train, newdata=test.dat)
- type= "response" is how to get predicted probabilities
logistic regression goal - Answer-find best estimates of B0, B1, B2...Bk
- Phat(Y=1) = +e^-(-# +#...)
Log-odds ratio - Answer-log(Odds(Y=1)) = B0 + B1x1 + ... + Bkxk
- linear model that predicts the log-odds of success
- odds = chance of success/chance of failure
- as p approaches 1, odds approach infinity
False positive - Answer-predicted Yhat=1 and actual observed value is Y=0
False negative - Answer-predicted Yhat=0 and actual observed value is Y=1
Given a Regression Tree and a new observation that falls in a certain region of the
predictor space (let's call it "Region X"), we predict the value of the dependent
variable as being equal to - Answer-the average value of the dependent variable in
Region X.
We typically avoid building an excessively large regression tree because, while it
may perform well on the training set, it is likely not to perform very well on new data
(the test set). This idea is best captured by which of the following concepts? -
Answer-overfitting
As the value of the complexity parameter (cp) increases, which of the following can
NOT occur? - Answer-The size of the tree increases
- larger the cp, more demanding of each split--> as cp increases, the same or fewer
splits survive pruning, so the tree size stays the same or decreases
When building a tree using the function "rpart", we specify method="anova" to: -
Answer-Tell R that we want to build a regression tree, as opposed to, for example, a
classification tree (method= "class")
Which of the following statements is true about the differences between the functions
printcp and plotcp? - Answer--printcp presents results by number of split
-plotcp presents results by number of leaves