Homework #1
2025-01-13
knitr::opts_chunk$set(echo = TRUE)
# Load required package
library(kernlab)
Assignment #1 Due January 16
##Question 2.1
Describe a situation or problem from your job, everyday life, cur-
rent events, etc., for which a classification model would be appro-
priate. List some (up to 5) predictors that you might use.
My job is a Senior Financial Analyst at University of Pennsylvania’s Division of Finance. One example
would be the school wanting to predict which students are at risk of defaulting on their tuition payment
plans. A classification model could help the university proactively identify at-risk students and intervene
with financial counseling or alternative payment options.
Predictors: 1. Family Income Level: The reported family income from the student’s financial aid application.
2. Payment History: Past behavior in making tuition payments on time (e.g., consistent, delayed, or missed
payments). 3. Financial Aid Received: The percentage of tuition covered by scholarships, grants, and loans
versus out-of-pocket payment. 4. Enrollment Status: Whether the student is enrolled full-time, part-time,
or has changed enrollment status mid-semester. 5. Extracurricular Commitments: The number of hours a
student spends on work-study jobs or other paid activities, which might indicate financial strain.
This model could enhance financial aid services by allowing targeted support for students most likely to face
financial difficulties, improving retention rates and student satisfaction.
##Question 2.2
1. Using the support vector machine function ksvm contained in
the R package kernlab, find a good classifier for this data. Show
the equation of your classifier, and how well it classifies the data
points in the full data set. (Don’t worry about test/validation data
yet; we’ll cover that topic soon.)
The optimal C value based on the for-loop is C =10. Using C=10, model yielded a reasonable proportion of
“Yes” at 53.7%. Accuracy is relatively good at 86.4%.
1
,The equation of this model is as follows: -0.0009033671*V1 – 0.0007891036*V2 – 0.0016972133*V3
+ 0.0026113628*V4 + 1.0050221406*V5 – 0.0028363016*V6 -0.0001569285*V7 - 0.0003925964*V8 –
0.0012784443*V9 + 0.1064387167*V10 + 0.08157559 = 0
V5 and V10 have the most significant contributions to the model.
# Load the data
data <- read.table("credit_card_data.txt", header = FALSE)
# Initialize vectors to store accuracy for each C
C_values <- c(10ˆ-20, 10ˆ-10, 10, 100, 1000, 10000, 100000, 1000000, 10000000)
accuracy <- 0
best_accuracy <- 0.0
best_C <- NA
# Loop over different values of C
for(i in seq_along(C_values)) {
C <- C_values[i]
# Train model using ksvm with current C value
model <- ksvm(as.matrix(data[,1:10]),
as.factor(data[,11]),
type="C-svc",
kernel="vanilladot", #Vanilladot is a simple linear kernel
C = C, scaled=TRUE)
# see what the model predicts
pred <- predict(model,data[,1:10])
# Calculate accuracy
accuracy <- sum(pred == data[,11]) / nrow(data)
if (accuracy > best_accuracy) {
best_accuracy = accuracy
best_C = C
# calculate a1...am for C=10
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
# calculate a0
a0 <- -model@b
# Proportion of data predicted as "Yes"
prop <- sum(pred == 1)/nrow(data) #53.67%
}
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
2
,## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
print(best_C) #10
## [1] 10
print(best_accuracy) #86.4%
## [1] 0.8639144
print(a)
## V1 V2 V3 V4 V5
## -0.0009033671 -0.0007891036 -0.0016972133 0.0026113628 1.0050221406
## V6 V7 V8 V9 V10
## -0.0028363016 -0.0001569285 -0.0003925964 -0.0012784443 0.1064387167
print(a0)
## [1] 0.08157559
print(prop) #53.7%
## [1] 0.5366972
2. You are welcome, but not required, to try other (nonlinear)
kernels as well; we’re not covering them in this course, but they
can sometimes be useful and might provide better predictions than
vanilladot.
# create a model using ksvm - Hyperbolic tangent kernel
model1 <- ksvm(as.matrix(data[,1:10]),
as.factor(data[,11]),
type="C-svc",
kernel="tanhdot",
C=100, scaled=TRUE)
## Setting default kernel parameters
# see what the model predicts
pred1 <- predict(model1,data[,1:10])
# Proportion of data predicted as "Yes" - Hyperbolic tangent kernel
sum(pred1 == 1)/nrow(data) #45.26%
3
, ## [1] 0.4525994
# see what fraction of the model’s predictions match the actual classification
sum(pred1 == data[,11]) / nrow(data) #72.17%
## [1] 0.7217125
Hyperbolic tangent kernel returns an accuracy of 72.2%
# create a model using ksvm - Polynomial kernel
model2 <- ksvm(as.matrix(data[,1:10]),
as.factor(data[,11]),
type="C-svc",
kernel="polydot",
C=100, scaled=TRUE)
## Setting default kernel parameters
# see what the model predicts
pred2 <- predict(model2,data[,1:10])
# Proportion of data predicted as "Yes" - Polynomial kernel
sum(pred2 == 1)/nrow(data) #53.67%
## [1] 0.5366972
# see what fraction of the model’s predictions match the actual classification
sum(pred2 == data[,11]) / nrow(data) #86.39%
## [1] 0.8639144
Polynomial kernel returns an accuracy of 86.4%.
Linear Kernel’s accuracy at 86.4% works better than Hyperbolic tangent kernel and at a similar accuracy
with Polynomial for this problem.
3. Using the k-nearest-neighbors classification function kknn con-
tained in the R kknn package, suggest a good value of k, and show
how well it classifies that data points in the full data set. Don’t
forget to scale the data (scale=TRUE in kknn).
By using a For-Loop, K value is tested from 1 to 20 and the K value producing the best accuracy is selected.
The selected K is 12 and the corresponding accuracy of the model is 85.3%.
4
2025-01-13
knitr::opts_chunk$set(echo = TRUE)
# Load required package
library(kernlab)
Assignment #1 Due January 16
##Question 2.1
Describe a situation or problem from your job, everyday life, cur-
rent events, etc., for which a classification model would be appro-
priate. List some (up to 5) predictors that you might use.
My job is a Senior Financial Analyst at University of Pennsylvania’s Division of Finance. One example
would be the school wanting to predict which students are at risk of defaulting on their tuition payment
plans. A classification model could help the university proactively identify at-risk students and intervene
with financial counseling or alternative payment options.
Predictors: 1. Family Income Level: The reported family income from the student’s financial aid application.
2. Payment History: Past behavior in making tuition payments on time (e.g., consistent, delayed, or missed
payments). 3. Financial Aid Received: The percentage of tuition covered by scholarships, grants, and loans
versus out-of-pocket payment. 4. Enrollment Status: Whether the student is enrolled full-time, part-time,
or has changed enrollment status mid-semester. 5. Extracurricular Commitments: The number of hours a
student spends on work-study jobs or other paid activities, which might indicate financial strain.
This model could enhance financial aid services by allowing targeted support for students most likely to face
financial difficulties, improving retention rates and student satisfaction.
##Question 2.2
1. Using the support vector machine function ksvm contained in
the R package kernlab, find a good classifier for this data. Show
the equation of your classifier, and how well it classifies the data
points in the full data set. (Don’t worry about test/validation data
yet; we’ll cover that topic soon.)
The optimal C value based on the for-loop is C =10. Using C=10, model yielded a reasonable proportion of
“Yes” at 53.7%. Accuracy is relatively good at 86.4%.
1
,The equation of this model is as follows: -0.0009033671*V1 – 0.0007891036*V2 – 0.0016972133*V3
+ 0.0026113628*V4 + 1.0050221406*V5 – 0.0028363016*V6 -0.0001569285*V7 - 0.0003925964*V8 –
0.0012784443*V9 + 0.1064387167*V10 + 0.08157559 = 0
V5 and V10 have the most significant contributions to the model.
# Load the data
data <- read.table("credit_card_data.txt", header = FALSE)
# Initialize vectors to store accuracy for each C
C_values <- c(10ˆ-20, 10ˆ-10, 10, 100, 1000, 10000, 100000, 1000000, 10000000)
accuracy <- 0
best_accuracy <- 0.0
best_C <- NA
# Loop over different values of C
for(i in seq_along(C_values)) {
C <- C_values[i]
# Train model using ksvm with current C value
model <- ksvm(as.matrix(data[,1:10]),
as.factor(data[,11]),
type="C-svc",
kernel="vanilladot", #Vanilladot is a simple linear kernel
C = C, scaled=TRUE)
# see what the model predicts
pred <- predict(model,data[,1:10])
# Calculate accuracy
accuracy <- sum(pred == data[,11]) / nrow(data)
if (accuracy > best_accuracy) {
best_accuracy = accuracy
best_C = C
# calculate a1...am for C=10
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
# calculate a0
a0 <- -model@b
# Proportion of data predicted as "Yes"
prop <- sum(pred == 1)/nrow(data) #53.67%
}
}
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
2
,## Setting default kernel parameters
## Setting default kernel parameters
## Setting default kernel parameters
print(best_C) #10
## [1] 10
print(best_accuracy) #86.4%
## [1] 0.8639144
print(a)
## V1 V2 V3 V4 V5
## -0.0009033671 -0.0007891036 -0.0016972133 0.0026113628 1.0050221406
## V6 V7 V8 V9 V10
## -0.0028363016 -0.0001569285 -0.0003925964 -0.0012784443 0.1064387167
print(a0)
## [1] 0.08157559
print(prop) #53.7%
## [1] 0.5366972
2. You are welcome, but not required, to try other (nonlinear)
kernels as well; we’re not covering them in this course, but they
can sometimes be useful and might provide better predictions than
vanilladot.
# create a model using ksvm - Hyperbolic tangent kernel
model1 <- ksvm(as.matrix(data[,1:10]),
as.factor(data[,11]),
type="C-svc",
kernel="tanhdot",
C=100, scaled=TRUE)
## Setting default kernel parameters
# see what the model predicts
pred1 <- predict(model1,data[,1:10])
# Proportion of data predicted as "Yes" - Hyperbolic tangent kernel
sum(pred1 == 1)/nrow(data) #45.26%
3
, ## [1] 0.4525994
# see what fraction of the model’s predictions match the actual classification
sum(pred1 == data[,11]) / nrow(data) #72.17%
## [1] 0.7217125
Hyperbolic tangent kernel returns an accuracy of 72.2%
# create a model using ksvm - Polynomial kernel
model2 <- ksvm(as.matrix(data[,1:10]),
as.factor(data[,11]),
type="C-svc",
kernel="polydot",
C=100, scaled=TRUE)
## Setting default kernel parameters
# see what the model predicts
pred2 <- predict(model2,data[,1:10])
# Proportion of data predicted as "Yes" - Polynomial kernel
sum(pred2 == 1)/nrow(data) #53.67%
## [1] 0.5366972
# see what fraction of the model’s predictions match the actual classification
sum(pred2 == data[,11]) / nrow(data) #86.39%
## [1] 0.8639144
Polynomial kernel returns an accuracy of 86.4%.
Linear Kernel’s accuracy at 86.4% works better than Hyperbolic tangent kernel and at a similar accuracy
with Polynomial for this problem.
3. Using the k-nearest-neighbors classification function kknn con-
tained in the R kknn package, suggest a good value of k, and show
how well it classifies that data points in the full data set. Don’t
forget to scale the data (scale=TRUE in kknn).
By using a For-Loop, K value is tested from 1 to 20 and the K value producing the best accuracy is selected.
The selected K is 12 and the corresponding accuracy of the model is 85.3%.
4