100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Exam (elaborations)

DATA MINING: EXAM SET QUESTIONS AND ANSWERS

Rating
-
Sold
-
Pages
6
Uploaded on
26-03-2025
Written in
2024/2025

DATA MINING: EXAM SET QUESTIONS AND ANSWERS

Institution
DATA MINING
Course
DATA MINING









Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
DATA MINING
Course
DATA MINING

Document information

Uploaded on
March 26, 2025
Number of pages
6
Written in
2024/2025
Type
Exam (elaborations)
Contains
Unknown

Subjects

Content preview

DATA MINING: EXAM SET QUESTIONS
AND ANSWERS
In a cross-validation plot, if the first value (from left to right) of cross-validation
relative error to fall under the dotted line is the single-node tree, then we may still
want to select a different value of tree size (and therefore a different value of cp)
because: - Answer-We typically do not want all observations to be predicted as
having the same value of the dependent variable.

Imagine that we used the following loss matrix:
matrix(c(0,1,4,0), nrow=2, ncol=2, byrow=FALSE)

Then: - Answer-We are saying that False Positives are 4 times more costly to us
than False Negatives

How do random forests ensure diversity in observations (that is, in rows of data)? -
Answer-By resampling using bootstrapping

How do random forests ensure diversity in explanatory variables? - Answer-By
randomly selecting which variables to use in each split

The Out-of-Bag (OOB) error gives us a sense of how the model performs in out-of-
sample data because: - Answer-For each row, it is computed using only trees that
did NOT use the data from that row

When R squared is much larger than OSR squared, this is a sign of: - Answer-
overfitting

what does sample.split do? - Answer-ensures that the dependent variable is similarly
distributed in both sets

Two main types of outcome data: - Answer-- continuous (age, price, income):
regression
- categorical (gender, debt default, brand): classification

R squared - Answer-gives a sense of how good our model is- higher is better
- explained variance/total variance

OSR squared - Answer-how the model that we built on the training set performs on
out of sample data (test set)
- 1- SSE/SST
- SSE: sum of square differences between what our model predicts (pred_vals) and
what actually happened
~ SSE= sum((test.dat$pred_vals - test.dat$sales)^2)
- SST: sum of square differences between our benchmark model (mean values in
training set) and what actually happened
~ SST= sum((train.mean - test.dat$sales)^)

, If histogram looks very skewed, you can: - Answer-apply a transformation to the
dependent variable
- ex. take logarithm using the log( ) function (base e, not 10)

checking for multicollinearity - Answer-- only worry about correlations among
explanatory variables
- only handles continuous variables
- cor(dat) for correlation matrix

predict funtion: - Answer-estimate the predicted values on the test set
- test.dat$pred_vals = predict(lin_train, newdata=test.dat)
- type= "response" is how to get predicted probabilities

logistic regression goal - Answer-find best estimates of B0, B1, B2...Bk
- Phat(Y=1) = +e^-(-# +#...)

Log-odds ratio - Answer-log(Odds(Y=1)) = B0 + B1x1 + ... + Bkxk
- linear model that predicts the log-odds of success
- odds = chance of success/chance of failure
- as p approaches 1, odds approach infinity

False positive - Answer-predicted Yhat=1 and actual observed value is Y=0

False negative - Answer-predicted Yhat=0 and actual observed value is Y=1

Given a Regression Tree and a new observation that falls in a certain region of the
predictor space (let's call it "Region X"), we predict the value of the dependent
variable as being equal to - Answer-the average value of the dependent variable in
Region X.

We typically avoid building an excessively large regression tree because, while it
may perform well on the training set, it is likely not to perform very well on new data
(the test set). This idea is best captured by which of the following concepts? -
Answer-overfitting

As the value of the complexity parameter (cp) increases, which of the following can
NOT occur? - Answer-The size of the tree increases
- larger the cp, more demanding of each split--> as cp increases, the same or fewer
splits survive pruning, so the tree size stays the same or decreases

When building a tree using the function "rpart", we specify method="anova" to: -
Answer-Tell R that we want to build a regression tree, as opposed to, for example, a
classification tree (method= "class")

Which of the following statements is true about the differences between the functions
printcp and plotcp? - Answer--printcp presents results by number of split
-plotcp presents results by number of leaves

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
biggdreamer Havard School
View profile
Follow You need to be logged in order to follow users or courses
Sold
247
Member since
2 year
Number of followers
68
Documents
17943
Last sold
1 week ago

4.0

38 reviews

5
22
4
4
3
6
2
2
1
4

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions