Tutorial 9: Regression Continued
Regression learning objectives:
Recognize situations where a simple regression analysis would be appropriate for making predictions.
Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.
Interpret the output of a k-nn regression.
In a dataset with two variables, perform k-nearest neighbour regression in R using tidymodels to predict the values for a test dataset.
Using R, execute cross-validation in R to choose the number of neighbours.
Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root means square prediction error).
In a dataset with > 2 variables, perform k-nn regression in R using tidymodels to predict the values for a test dataset.
In the context of k-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
Describe advantages and disadvantages of the k-nearest neighbour regression approach.
Perform ordinary least squares regression in R using tidymodels to predict the values for a test dataset.
Compare and contrast predictions obtained from k-nearest neighbour regression to those obtained using simple ordinary least squares regression
from the same dataset.
In R, overlay the ordinary least squares regression lines from geom_smooth on a single plot.
In [ ]:
### Run this cell before continuing.
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)
source("tests.R")
source("cleanup.R")
Predicting credit card balance
Source: https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized-large.gif (https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized-
large.gif)
Here in this worksheet we will work with a simulated data set that contains information that we can use to create a model to predict customer credit card
balance. A bank might use such information to predict which customers might be the most profitable to lend to (customers who carry a balance, but do not
default, for example).
Specifically, we wish to build a model to predict credit card balance ( Balance column) based on income ( Income column) and credit rating ( Rating
column).
We access this data set by accessing it from an R data package that we loaded at the beginning of the worksheet, ISLR . Loading that package gives
access to a variety of data sets, including the Credit data set that we will be working with.
In [ ]:
Credit
, Question 1.1
{points: 1}
Select only the columns of data we are interested in using for our prediction (both the predictors and the response variable) and use the as_tibble
function to convert it to a tibble (it is currently a base R data frame). Name the modified data frame credit (using a lowercase c).
Note: We could alternatively just leave these variables in and use our recipe formula below to specify our predictors and response. But for this worksheet,
let's select the relevant columns first.
In [ ]:
### BEGIN SOLUTION
credit <- Credit %>%
select(Balance, Income, Rating) %>%
as_tibble()
### END SOLUTION
credit
In [ ]:
test_1.1()
Question 1.2
{points: 1}
Before we perform exploratory data analysis, we should create our training and testing data sets. First, split the credit data set. Use 60% of the data
and set the variables we want to predict as the strata argument. Assign your answer to an object called credit_split .
Assign your training data set to an object called credit_training and your testing data set to an object called credit_testing .
In [ ]:
set.seed(2000)
### BEGIN SOLUTION
credit_split <- initial_split(credit, prop = 0.6, strata = Balance)
credit_training <- training(credit_split)
credit_testing <- testing(credit_split)
### END SOLUTION
In [ ]:
test_1.2()
Question 1.3
{points: 1}
Using only the observations in the training data set, create a ggpairs scatterplot of all the columns we are interested in including in our model. Name
the plot object credit_eda .
In [ ]:
### BEGIN SOLUTION
options(repr.plot.height = 10, repr.plot.width = 15)
credit_eda <- credit_training %>%
ggpairs(mapping = aes(alpha = 0.4)) +
theme(text = element_text(size = 20))
### END SOLUTION
credit_eda
In [ ]:
test_1.3()
Regression learning objectives:
Recognize situations where a simple regression analysis would be appropriate for making predictions.
Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.
Interpret the output of a k-nn regression.
In a dataset with two variables, perform k-nearest neighbour regression in R using tidymodels to predict the values for a test dataset.
Using R, execute cross-validation in R to choose the number of neighbours.
Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root means square prediction error).
In a dataset with > 2 variables, perform k-nn regression in R using tidymodels to predict the values for a test dataset.
In the context of k-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
Describe advantages and disadvantages of the k-nearest neighbour regression approach.
Perform ordinary least squares regression in R using tidymodels to predict the values for a test dataset.
Compare and contrast predictions obtained from k-nearest neighbour regression to those obtained using simple ordinary least squares regression
from the same dataset.
In R, overlay the ordinary least squares regression lines from geom_smooth on a single plot.
In [ ]:
### Run this cell before continuing.
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)
source("tests.R")
source("cleanup.R")
Predicting credit card balance
Source: https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized-large.gif (https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized-
large.gif)
Here in this worksheet we will work with a simulated data set that contains information that we can use to create a model to predict customer credit card
balance. A bank might use such information to predict which customers might be the most profitable to lend to (customers who carry a balance, but do not
default, for example).
Specifically, we wish to build a model to predict credit card balance ( Balance column) based on income ( Income column) and credit rating ( Rating
column).
We access this data set by accessing it from an R data package that we loaded at the beginning of the worksheet, ISLR . Loading that package gives
access to a variety of data sets, including the Credit data set that we will be working with.
In [ ]:
Credit
, Question 1.1
{points: 1}
Select only the columns of data we are interested in using for our prediction (both the predictors and the response variable) and use the as_tibble
function to convert it to a tibble (it is currently a base R data frame). Name the modified data frame credit (using a lowercase c).
Note: We could alternatively just leave these variables in and use our recipe formula below to specify our predictors and response. But for this worksheet,
let's select the relevant columns first.
In [ ]:
### BEGIN SOLUTION
credit <- Credit %>%
select(Balance, Income, Rating) %>%
as_tibble()
### END SOLUTION
credit
In [ ]:
test_1.1()
Question 1.2
{points: 1}
Before we perform exploratory data analysis, we should create our training and testing data sets. First, split the credit data set. Use 60% of the data
and set the variables we want to predict as the strata argument. Assign your answer to an object called credit_split .
Assign your training data set to an object called credit_training and your testing data set to an object called credit_testing .
In [ ]:
set.seed(2000)
### BEGIN SOLUTION
credit_split <- initial_split(credit, prop = 0.6, strata = Balance)
credit_training <- training(credit_split)
credit_testing <- testing(credit_split)
### END SOLUTION
In [ ]:
test_1.2()
Question 1.3
{points: 1}
Using only the observations in the training data set, create a ggpairs scatterplot of all the columns we are interested in including in our model. Name
the plot object credit_eda .
In [ ]:
### BEGIN SOLUTION
options(repr.plot.height = 10, repr.plot.width = 15)
credit_eda <- credit_training %>%
ggpairs(mapping = aes(alpha = 0.4)) +
theme(text = element_text(size = 20))
### END SOLUTION
credit_eda
In [ ]:
test_1.3()