Chapter 6
Decision-Making Using Machine Learning Basics
Chapter Review
[6.2, LO 6.2.1]
1. You are working with a dataset containing information about customer purchases at an
online retail store. Each data point represents a customer and includes features such as age,
gender, location, browsing history, and purchase history. Your task is to segment the customers
into distinct groups based on their purchasing behavior in order to personalize marketing
strategies. Which of the following machine learning techniques is best suited for this scenario?
a. linear or multiple linear regression
b. logistic or multiple logistic regression
c. k-means clustering
d. naïve Bayes classification
Solution: c. k-means clustering
This is a clustering problem. K-means clustering is the best choice. Regression techniques
cannot be used with non-numerical features in the data (such as gender, browsing history, and
purchase history). The data is unlabeled since we do not already know the groups that
customers will be classified into, so naïve Bayes classification is not appropriate.
Critical Thinking
[6.1, LO 6.1.2, 6.1.4]
1. Discuss how different ratios of training versus testing data can affect the model in terms of
underfitting and overfitting. How does the testing set provide a means to identify issues with
underfitting and overfitting?
Solution: When a model is trained on a large proportion of the dataset (for example, 90%
training and 10% testing), the model may pick up on more details in the dataset, giving it high
accuracy on the training set. If those details are due to random noise or outliers, the model may
be prone to overfitting in this case.
When a model is trained on a small proportion of the dataset (for example, 50% training and
50% testing), the model may not see enough training data to learn complex relationships that
do exist in the dataset. So, in this case, the model is prone to underfitting.
If the model’s accuracy is significantly lower on the testing set, this indicates an issue with
either underfitting or overfitting. However, in the case of a large train/test ratio, the testing set
may be too small to evaluate the model adequately. This is why it is important to use a
substantial testing set in most machine learning algorithms.
[6.3, LO 6.3.2]
11/11/24 For more free, peer-reviewed, openly licensed resources visit OpenStax.org. 2