CPSC HW 1234 WITH COMPLETE
SOLUTIONS
You should start with the following CountVectorizer and LogisticRegression objects, as
well as X_train and y_train (which you should further split with train_test_split and
shuffle=False):
You are given the following
countvec = CountVectorizer(stop_words="english")
lr = LogisticRegression(max_iter=1000, random_state=123) - ANSWER-# BEGIN
SOLUTION
X_train_fold, X_valid_fold, y_train_fold, y_valid_fold = train_test_split(
X_train, y_train, test_size=0.2, shuffle = False
)
X_train_fold_vec = countvec.fit_transform(X_train_fold)
X_valid_fold_vec = countvec.transform(X_valid_fold)
lr.fit(X_train_fold_vec, y_train_fold)
fold_score = lr.score(X_valid_fold_vec, y_valid_fold)
# END SOLUTION
0 age 13024 non-null int64
1 workclass 12284 non-null object
2 fnlwgt 13024 non-null int64
3 education 13024 non-null object
4 education.num 13024 non-null int64
5 marital.status 13024 non-null object
6 occupation 12281 non-null object
...
Given the information above, after performing cross validation with dummy classifier, at
this point, if you train sklearn's SVC model on X_train and y_train would it work? Why or
why not? - ANSWER-It won't work at this point because our data is not preprocessed
yet; we have some categorical columns and some NaN values in numeric columns. We
need to preprocess it first before feeding it into ML algorithms.
After performing the CV, you get
max_features = 100
, - train = 0.843253
- cv = 0.839331
max_features = 1000
- train = 0.911779
- cv = 0.911779
max_features = 10,000
- train = 0.964317
- cv = 0.894983
max_features = 100,000
- train = 0.976644
- cv = 0.895098
which one should you choose? - ANSWER-In terms of cross-validation score, it looks
like the best is max_features=100_000. In this case that means using all the words,
since the total number of words = 27345, which is less than 100,000
Discuss how changing the max_depth hyperparameter affects the training and cross-
validation accuracy.
What does it mean when the accuracy is 1.0 for max_depth >= 15 - ANSWER-In case
of the training data, a higher value of max_depth parameter results in higher accuracy.
For max_depth >= 15 the accuracy is 1.0, which means that the model is able to
classify all training examples perfectly. This happens because for higher max_depth
values, the decision tree learns a specific rule for almost all examples in the training
data. In case of the cross-validation scores, initially the accuracy increases a bit and
then it goes back down.
Generally speaking, should the best CV scores from the optimization of *individual
hyperparameter* agree with the best CV scores from the joint optimization (multiple
hyperparameters)? Why or why not? - ANSWER-In general there is no reason they
need to agree - by jointly optimizing the hyperparameters you might find something
better.
Given
validation score = 0.8955017301038062
test score = 0.8913193910502845
How does your test accuracy compare to your validation accuracy?
If they are different: do you think this is because you "overfitted on the validation set", or
simply random luck? - ANSWER-The test score is very close to the cross-validation
score. It doesn't seem like we are overfitting on the validation set.
given: CV score = 0.679, test score = 0.683
SOLUTIONS
You should start with the following CountVectorizer and LogisticRegression objects, as
well as X_train and y_train (which you should further split with train_test_split and
shuffle=False):
You are given the following
countvec = CountVectorizer(stop_words="english")
lr = LogisticRegression(max_iter=1000, random_state=123) - ANSWER-# BEGIN
SOLUTION
X_train_fold, X_valid_fold, y_train_fold, y_valid_fold = train_test_split(
X_train, y_train, test_size=0.2, shuffle = False
)
X_train_fold_vec = countvec.fit_transform(X_train_fold)
X_valid_fold_vec = countvec.transform(X_valid_fold)
lr.fit(X_train_fold_vec, y_train_fold)
fold_score = lr.score(X_valid_fold_vec, y_valid_fold)
# END SOLUTION
0 age 13024 non-null int64
1 workclass 12284 non-null object
2 fnlwgt 13024 non-null int64
3 education 13024 non-null object
4 education.num 13024 non-null int64
5 marital.status 13024 non-null object
6 occupation 12281 non-null object
...
Given the information above, after performing cross validation with dummy classifier, at
this point, if you train sklearn's SVC model on X_train and y_train would it work? Why or
why not? - ANSWER-It won't work at this point because our data is not preprocessed
yet; we have some categorical columns and some NaN values in numeric columns. We
need to preprocess it first before feeding it into ML algorithms.
After performing the CV, you get
max_features = 100
, - train = 0.843253
- cv = 0.839331
max_features = 1000
- train = 0.911779
- cv = 0.911779
max_features = 10,000
- train = 0.964317
- cv = 0.894983
max_features = 100,000
- train = 0.976644
- cv = 0.895098
which one should you choose? - ANSWER-In terms of cross-validation score, it looks
like the best is max_features=100_000. In this case that means using all the words,
since the total number of words = 27345, which is less than 100,000
Discuss how changing the max_depth hyperparameter affects the training and cross-
validation accuracy.
What does it mean when the accuracy is 1.0 for max_depth >= 15 - ANSWER-In case
of the training data, a higher value of max_depth parameter results in higher accuracy.
For max_depth >= 15 the accuracy is 1.0, which means that the model is able to
classify all training examples perfectly. This happens because for higher max_depth
values, the decision tree learns a specific rule for almost all examples in the training
data. In case of the cross-validation scores, initially the accuracy increases a bit and
then it goes back down.
Generally speaking, should the best CV scores from the optimization of *individual
hyperparameter* agree with the best CV scores from the joint optimization (multiple
hyperparameters)? Why or why not? - ANSWER-In general there is no reason they
need to agree - by jointly optimizing the hyperparameters you might find something
better.
Given
validation score = 0.8955017301038062
test score = 0.8913193910502845
How does your test accuracy compare to your validation accuracy?
If they are different: do you think this is because you "overfitted on the validation set", or
simply random luck? - ANSWER-The test score is very close to the cross-validation
score. It doesn't seem like we are overfitting on the validation set.
given: CV score = 0.679, test score = 0.683