Assignment 1: Statistical Learning
(Note: Estimating statistical models on data set as small as considered in
this assignment typically makes no sense. However, here it allows us to
perform the calculations by hand and makes the concepts palpable.)
1. You work for a company that has several ice cream stores close to
beaches and sells only one variant of ice cream. You get a data set
containing the following pieces of historical information for each day
and store: ice-cream demand, average water temperature, average air
temperature, max wind speed, whether the day was during the week
or on the weekend, whether it was a holiday in the respective location,
precipitation, humidity, and a brief description of the sky as clear,
partly cloudy or mostly cloudy. Your boss asks you to set up a model
with demand as the dependent variable and some of the other variables
as independent variables. She tells you to randomly divide the data
set into one training data set and one test data set. Each of these two
data sets should contain half of the observations, and no observation
should be part of both data sets. The test set is for validating the
model.
(a) For each variable, briefly explain if it is a quantitative, qualitative
or indicator variable.
(b) Your boss wants to use your model to determine how much ice
cream to produce for each store. Discuss if the primary goal
should be inference or prediction.
(c) Suppose your boss also wants to use your model to decide where
to open new stores. Discuss if the primary goal should be infer-
ence or prediction.
(d) Can you tell how the training MSE will change if you use a more
flexible model? Can you tell how the test MSE will change if you
use a more flexible model? Explain.
(e) Your boss suggests estimating the following model:
ŷ = β̂0 + β̂1 · xwater temp + β̂2 · xair temp + β̂3 · xweekday
Explain if this is a parametric or non-parametric method.
1
(Note: Estimating statistical models on data set as small as considered in
this assignment typically makes no sense. However, here it allows us to
perform the calculations by hand and makes the concepts palpable.)
1. You work for a company that has several ice cream stores close to
beaches and sells only one variant of ice cream. You get a data set
containing the following pieces of historical information for each day
and store: ice-cream demand, average water temperature, average air
temperature, max wind speed, whether the day was during the week
or on the weekend, whether it was a holiday in the respective location,
precipitation, humidity, and a brief description of the sky as clear,
partly cloudy or mostly cloudy. Your boss asks you to set up a model
with demand as the dependent variable and some of the other variables
as independent variables. She tells you to randomly divide the data
set into one training data set and one test data set. Each of these two
data sets should contain half of the observations, and no observation
should be part of both data sets. The test set is for validating the
model.
(a) For each variable, briefly explain if it is a quantitative, qualitative
or indicator variable.
(b) Your boss wants to use your model to determine how much ice
cream to produce for each store. Discuss if the primary goal
should be inference or prediction.
(c) Suppose your boss also wants to use your model to decide where
to open new stores. Discuss if the primary goal should be infer-
ence or prediction.
(d) Can you tell how the training MSE will change if you use a more
flexible model? Can you tell how the test MSE will change if you
use a more flexible model? Explain.
(e) Your boss suggests estimating the following model:
ŷ = β̂0 + β̂1 · xwater temp + β̂2 · xair temp + β̂3 · xweekday
Explain if this is a parametric or non-parametric method.
1