Exam (elaborations)

DATA MINING: EXAM SET QUESTIONS AND ANSWERS

Rating

Sold

Pages

Uploaded on

26-03-2025

Written in

2024/2025

DATA MINING: EXAM SET QUESTIONS AND ANSWERS

Institution

DATA MINING

Course

DATA MINING

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: DATA MINING
Course: DATA MINING

Document information

Uploaded on: March 26, 2025
Number of pages: 6
Written in: 2024/2025
Type: Exam (elaborations)
Contains: Unknown

Subjects

data mining exam set questions and answers

Content preview

DATA MINING: EXAM SET QUESTIONS
AND ANSWERS
In a cross-validation plot, if the first value (from left to right) of cross-validation
relative error to fall under the dotted line is the single-node tree, then we may still
want to select a different value of tree size (and therefore a different value of cp)
because: - Answer-We typically do not want all observations to be predicted as
having the same value of the dependent variable.

Imagine that we used the following loss matrix:
matrix(c(0,1,4,0), nrow=2, ncol=2, byrow=FALSE)

Then: - Answer-We are saying that False Positives are 4 times more costly to us
than False Negatives

How do random forests ensure diversity in observations (that is, in rows of data)? -
Answer-By resampling using bootstrapping

How do random forests ensure diversity in explanatory variables? - Answer-By
randomly selecting which variables to use in each split

The Out-of-Bag (OOB) error gives us a sense of how the model performs in out-of-
sample data because: - Answer-For each row, it is computed using only trees that
did NOT use the data from that row

When R squared is much larger than OSR squared, this is a sign of: - Answer-
overfitting

what does sample.split do? - Answer-ensures that the dependent variable is similarly
distributed in both sets

Two main types of outcome data: - Answer-- continuous (age, price, income):
regression
- categorical (gender, debt default, brand): classification

R squared - Answer-gives a sense of how good our model is- higher is better
- explained variance/total variance

OSR squared - Answer-how the model that we built on the training set performs on
out of sample data (test set)
- 1- SSE/SST
- SSE: sum of square differences between what our model predicts (pred_vals) and
what actually happened
~ SSE= sum((test.dat$pred_vals - test.dat$sales)^2)
- SST: sum of square differences between our benchmark model (mean values in
training set) and what actually happened
~ SST= sum((train.mean - test.dat$sales)^)

, If histogram looks very skewed, you can: - Answer-apply a transformation to the
dependent variable
- ex. take logarithm using the log( ) function (base e, not 10)

checking for multicollinearity - Answer-- only worry about correlations among
explanatory variables
- only handles continuous variables
- cor(dat) for correlation matrix

predict funtion: - Answer-estimate the predicted values on the test set
- test.dat$pred_vals = predict(lin_train, newdata=test.dat)
- type= "response" is how to get predicted probabilities

logistic regression goal - Answer-find best estimates of B0, B1, B2...Bk
- Phat(Y=1) = +e^-(-# +#...)

Log-odds ratio - Answer-log(Odds(Y=1)) = B0 + B1x1 + ... + Bkxk
- linear model that predicts the log-odds of success
- odds = chance of success/chance of failure
- as p approaches 1, odds approach infinity

False positive - Answer-predicted Yhat=1 and actual observed value is Y=0

False negative - Answer-predicted Yhat=0 and actual observed value is Y=1

Given a Regression Tree and a new observation that falls in a certain region of the
predictor space (let's call it "Region X"), we predict the value of the dependent
variable as being equal to - Answer-the average value of the dependent variable in
Region X.

We typically avoid building an excessively large regression tree because, while it
may perform well on the training set, it is likely not to perform very well on new data
(the test set). This idea is best captured by which of the following concepts? -
Answer-overfitting

As the value of the complexity parameter (cp) increases, which of the following can
NOT occur? - Answer-The size of the tree increases
- larger the cp, more demanding of each split--> as cp increases, the same or fewer
splits survive pruning, so the tree size stays the same or decreases

When building a tree using the function "rpart", we specify method="anova" to: -
Answer-Tell R that we want to build a regression tree, as opposed to, for example, a
classification tree (method= "class")

Which of the following statements is true about the differences between the functions
printcp and plotcp? - Answer--printcp presents results by number of split
-plotcp presents results by number of leaves

$15.49

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

biggdreamer

4.0

(38)

Also available in package deal

Get to know the seller

biggdreamer Havard School

View profile

Sold

247

Member since

2 year

Number of followers

Documents

17943

Last sold

1 week ago

4.0

38 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller biggdreamer. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $15.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 47134 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 15 years now

DATA MINING: EXAM SET QUESTIONS AND ANSWERS

Written for

Document information

Subjects

Content preview

Also available in package deal

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?