Solutions Manual
to Accompany
Data Science and Machine Learning:
Mathematical and Statistical Methods
Dirk P. Kroese Zdravko I. Botev Thomas Taimre
Slava Vaisman Robert Salomone
8th January 2020
,CONTENTS
Preface 3
1 Importing, Summarizing, and Visualizing Data 5
2 Statistical Learning 17
3 Monte Carlo Methods 35
4 Unsupervised Learning 65
5 Regression 79
6 Kernel Methods 99
7 Classification 115
8 Tree Methods 139
9 Deep Learning 149
2
, P REFACE
We believe that the only effective way to master the theory and practice of Data Science
and Machine learning is through exercises and experiments. For this reason, we included
many exercises and algorithms in Data Science and Machine Learning: Mathematical and
Statistical Methods (DSML), Chapman and Hall/CRC, 2019.
This companion volume to DSML is written in the same style and contains a wealth of
additional material: worked solutions for the over 150 exercises in DSML, many Python
programs and additional illustrations.
Like DSML, this solution manual is aimed at anyone interested in gaining a better un-
derstanding of the mathematics and statistics that underpin the rich variety of ideas and
machine learning algorithms in data science. One of the main goals of the manual is to
provide a comprehensive solutions guide to instructors, which will aid student assessment
and stimulate further student development. In addition, this manual offers a unique com-
plement to DSML for self-study. All too often a stumbling block for learning is the un-
availability of worked solutions and actual algorithms.
The solutions manual covers a wide range of exercises in data analysis, statistical
learning, Monte Carlo methods, unsupervised learning, regression, regularization and ker-
nel methods, classification, decision trees and ensemble methods, and deep learning. Our
choice of using Python was motivated by its ease of use and clarity of syntax.
Reference numbers to DSML are indicated in boldface blue font. For example, Defini-
tion 1.1.1 refers to the corresponding definition in DSML, and (1.7) refers to equation (1.7)
in DSML, whereas Figure 1.1 refers to the first numbered figure in the present document.
This solutions manual was financially supported by the Australian Research Coun-
cil Centre of Excellence for Mathematical & Statistical Frontiers, under grant number
CE140100049.
Dirk Kroese, Zdravko Botev,
Thomas Taimre, Radislav Vaisman, and Robert Salomone
Brisbane and Sydney
3
,4 Contents
, CHAPTER 1
I MPORTING , S UMMARIZING , AND
V ISUALIZING DATA
1. Visit the UCI Repository https://archive.ics.uci.edu/. Read the description of
the data and download the Mushroom data set agaricus-lepiota.data. Using pandas,
read the data into a DataFrame called mushroom, via read_csv.
We can import the file directly via its URL:
import pandas as pd
URL = 'http :// archive .ics.uci.edu/ml/machine -learning - databases /
mushroom /agaricus - lepiota .data '
mushroom = pd. read_csv (URL , header =None)
(a) How many features are in this data set?
Solution: There are 23 features.
mushroom .info ()
<class 'pandas .core. frame .DataFrame '>
RangeIndex : 8124 entries , 0 to 8123
Data columns ( total 23 columns ):
0 8124 non -null object
1 8124 non -null object
2 8124 non -null object
3 8124 non -null object
4 8124 non -null object
5 8124 non -null object
6 8124 non -null object
7 8124 non -null object
8 8124 non -null object
9 8124 non -null object
10 8124 non -null object
11 8124 non -null object
12 8124 non -null object
13 8124 non -null object
14 8124 non -null object
15 8124 non -null object
5
,6
16 8124 non -null object
17 8124 non -null object
18 8124 non -null object
19 8124 non -null object
20 8124 non -null object
21 8124 non -null object
22 8124 non -null object
dtypes : object (23)
memory usage : 1.4+ MB
(b) What are the initial names and types of the features?
Solution: From the output of mushroom.info(), we see that the initial names of the
features are 0, 1, 2, . . . , 22, and that they all have the type object.
(c) Rename the first feature (index 0) to 'edibility' and the sixth feature (index 5) to
'odor' [Hint: the column names in pandas are immutable; so individual columns
cannot be modified directly. However it is possible to assign the entire column names
list via mushroom.columns = newcols. ]
Solution:
# create a list object that contains the column names
newcols = mushroom . columns . tolist ()
# assign new values in the list
newcols [0] = 'edibility '
newcols [5] = 'odor '
# replace the column names with our list
mushroom . columns = newcols
(d) The 6th column lists the various odors of the mushrooms: encoded as 'a', 'c', . . . .
Replace these with the names 'almond', 'creosote', etc. (categories correspond-
ing to each letter can be found on the website). Also replace the 'edibility' cat-
egories 'e' and 'p' with 'edible' and 'poisonous'.
Solution:
DICT = {'a': " almond ", 'c': " creosote ", 'f': "foul",
'l':" anise",'m': " musty ",'n':"none", 'p': " pungent ",
's':" spicy", 'y':" fishy "}
mushroom .odor = mushroom .odor. replace (DICT)
DICT = {'e': " edible ", 'p':" poisonous "}
mushroom . edibility = mushroom . edibility . replace (DICT)
(e) Make a contingency table cross-tabulating 'edibility' and 'odor'.
Solution:
pd. crosstab ( mushroom .odor , mushroom . edibility )
edibility edible poisonous
odor
almond 400 0
,Chapter 1. Importing, Summarizing, and Visualizing Data
anise 400 0
creosote 0 192
fishy 0 576
foul 0 2160
musty 0 36
none 3408 120
pungent 0 256
spicy 0 576
(f) Which mushroom odors should be avoided, when gathering mushrooms for consump-
tion?
Solution: From the table in the previous question, we see that the data indicates that
all mushroom odors except almond and anise have observations that are poisonous.
Thus, all odors other than these two should be avoided.
(g) What proportion of odorless mushroom samples were safe to eat?
Solution: We can calculate the proportion by obtaining values directly off the contin-
gency table and directly calculating:
3408/(3408+120)
0.9659863945578231
Alternatively, we can find the answer without the table:
mushroom [( mushroom . edibility == 'edible ') & ( mushroom .odor == '
none ')]. shape [0]/ mushroom [ mushroom .odor =='none ']. shape [0]
0.9659863945578231
2. Change the type and value of variables in the nutri data set according to Table 1.2 and
save the data as a CSV file. The modified data should have eight categorical features, three
floats, and two integer features.
Solution:
import pandas as pd , numpy as np
nutri = pd. read_excel ('nutrition_elderly .xls ')
nutri . gender = nutri . gender . replace ({1: 'Male ' , 2: 'Female '}).
astype ('category ')
nutri . situation = nutri . situation . replace ({1: 'Single ' , 2: 'Couple '
, 3: 'Family '}). astype ('category ')
# create a dictionary that will be used for multiple columns
freq_dict = {0: 'Never ', 1:'< once a week ' , 2: 'once a week ', 3: '
2/3 times a week ', 4: '4-6 times a week ', 5: 'every day '}
cols = ['meat ', 'fish ', 'raw_fruit ', 'cooked_fruit_veg ',
'chocol ']
nutri [cols] = nutri [cols ]. replace ( freq_dict ). astype ('category ')
,8
nutri .fat = nutri .fat. replace ({1: 'Butter ', 2: 'Margarine ', 3: '
Peanut oil ', 4: 'Sunflower oil ', 5: 'Olive oil ', 6: 'Mix of
vegetable oils ', 7: 'Colza oil ', 8: 'Duck or goose fat '}). astype (
'category ')
# assign the float data type to the required columns
cols = ['height ', 'weight ', 'age ']
nutri [cols] = nutri [cols ]. astype ('float ')
We then verify that the modified data has the correct types for each feature:
nutri .info ()
#< class 'pandas .core. frame .DataFrame '>
# RangeIndex : 226 entries , 0 to 225
#Data columns ( total 13 columns ):
# gender 226 non -null category
# situation 226 non -null category
#tea 226 non -null int64
# coffee 226 non -null int64
# height 226 non -null float64
# weight 226 non -null float64
#age 226 non -null float64
#meat 226 non -null category
#fish 226 non -null category
# raw_fruit 226 non -null category
# cooked_fruit_veg 226 non -null category
# chocol 226 non -null category
#fat 226 non -null category
# dtypes : category (8) , float64 (3) , int64 (2)
# memory usage : 12.3 KB
3. It frequently happens that a table with data needs to be restructured before the data can
be analyzed using standard statistical software. As an example, consider the test scores in
Table 1.3 of 5 students before and after specialized tuition.
Table 1.3: Student scores.
Student Before After
1 75 85
2 30 50
3 100 100
4 50 52
5 60 65
This is not in the standard format described in Section 1.1. In particular, the student scores
are divided over two columns, whereas the standard format requires that they are collected
in one column, e.g., labelled 'Score'. Reformat the table in standard format, using three
features:
,Chapter 1. Importing, Summarizing, and Visualizing Data
• 'Score', taking continuous values,
• 'Time', taking values 'Before' and 'After',
• 'Student', taking values from 1 to 5.
Solution: Up to a possible reordering of the rows, your table should look like the one given
below, which was made with the melt method of pandas.
# manually create dataframe with data from table
values = [[1 ,75 ,85] ,[2 ,30 ,50] ,[3 ,100 ,100] ,[4 ,50 ,52] ,[5 ,60 ,65]]
import pandas as pd
df = pd. DataFrame (values , columns =[ 'Student ','Before ', 'After '])
# format dataframe as required
df = pd.melt(df , id_vars =[ 'Student '], var_name ="Time", value_vars =['
Before ','After '])
print (df)
Student Time value
0 1 Before 75
1 2 Before 30
2 3 Before 100
3 4 Before 50
4 5 Before 60
5 1 After 85
6 2 After 50
7 3 After 100
8 4 After 52
9 5 After 65
4. Create a similar barplot as in Figure 1.5, but now plot the corresponding proportions of
males and females in each of the three situation categories. That is, the heights of the bars
should sum up to 1 for both barplots with the same ’gender’ value. [Hint: seaborn does
not have this functionality built in, instead you need to first create a contingency table and
use matplotlib.pyplot to produce the figure.]
Solution:
import pandas as pd
import numpy as np
import matplotlib . pyplot as plt
xls = 'http :// www. biostatisticien .eu/ springeR / nutrition_elderly .xls '
nutri = pd. read_excel (xls)
nutri . gender = nutri . gender . replace ({1: 'Male ' , 2: 'Female '}).
astype ('category ')
nutri . situation = nutri . situation . replace ({1: 'Single ' , 2: 'Couple '
, 3: 'Family '}). astype ('category ')
contingencyTable = pd. crosstab ( nutri .gender , nutri . situation )
male_counts = contingencyTable . stack ().Male
, 10
female_counts = contingencyTable . stack (). Female
xind = np. arange (len( nutri . situation . unique ()))
width = 0.3
plt. figure ( figsize =[5 ,3])
plt.bar(xind - width /2, male_counts / male_counts .sum () , width , color ='
SkyBlue ', label ='Men ', edgecolor ='black ')
plt.bar(xind + width /2, female_counts / female_counts .sum () , width ,
color ='Pink ', label ='Women ', edgecolor ='black ')
plt. ylabel ('Proportions ')
plt. xticks (xind + width /2, contingencyTable . columns )
plt. legend (loc = (0.4 ,0.7))
0.7
Men
0.6 Women
0.5
Proportions
0.4
0.3
0.2
0.1
0.0
Couple Family Single
+2 5. The iris data set, mentioned in Section 1.1, contains various features, including
'Petal.Length' and 'Sepal.Length', of three species of iris: setosa, versicolor, and
virginica.
(a) Load the data set into a pandas DataFrame object.
Solution:
import pandas as pd
urlprefix = 'http :// vincentarelbundock . github .io/ Rdatasets /csv/'
dataname = 'datasets /iris.csv '
iris = pd. read_csv ( urlprefix + dataname )
(b) Using matplotlib.pyplot, produce boxplots of 'Petal.Length' for each the
three species, in one figure.
Solution:
import matplotlib . pyplot as plt
labels = [" setosa "," versicolor "," virginica "]
plt. boxplot ([ setosa [" Sepal . Length "], versicolor [" Sepal . Length "],
virginica [" Sepal . Length "]], labels = labels )
to Accompany
Data Science and Machine Learning:
Mathematical and Statistical Methods
Dirk P. Kroese Zdravko I. Botev Thomas Taimre
Slava Vaisman Robert Salomone
8th January 2020
,CONTENTS
Preface 3
1 Importing, Summarizing, and Visualizing Data 5
2 Statistical Learning 17
3 Monte Carlo Methods 35
4 Unsupervised Learning 65
5 Regression 79
6 Kernel Methods 99
7 Classification 115
8 Tree Methods 139
9 Deep Learning 149
2
, P REFACE
We believe that the only effective way to master the theory and practice of Data Science
and Machine learning is through exercises and experiments. For this reason, we included
many exercises and algorithms in Data Science and Machine Learning: Mathematical and
Statistical Methods (DSML), Chapman and Hall/CRC, 2019.
This companion volume to DSML is written in the same style and contains a wealth of
additional material: worked solutions for the over 150 exercises in DSML, many Python
programs and additional illustrations.
Like DSML, this solution manual is aimed at anyone interested in gaining a better un-
derstanding of the mathematics and statistics that underpin the rich variety of ideas and
machine learning algorithms in data science. One of the main goals of the manual is to
provide a comprehensive solutions guide to instructors, which will aid student assessment
and stimulate further student development. In addition, this manual offers a unique com-
plement to DSML for self-study. All too often a stumbling block for learning is the un-
availability of worked solutions and actual algorithms.
The solutions manual covers a wide range of exercises in data analysis, statistical
learning, Monte Carlo methods, unsupervised learning, regression, regularization and ker-
nel methods, classification, decision trees and ensemble methods, and deep learning. Our
choice of using Python was motivated by its ease of use and clarity of syntax.
Reference numbers to DSML are indicated in boldface blue font. For example, Defini-
tion 1.1.1 refers to the corresponding definition in DSML, and (1.7) refers to equation (1.7)
in DSML, whereas Figure 1.1 refers to the first numbered figure in the present document.
This solutions manual was financially supported by the Australian Research Coun-
cil Centre of Excellence for Mathematical & Statistical Frontiers, under grant number
CE140100049.
Dirk Kroese, Zdravko Botev,
Thomas Taimre, Radislav Vaisman, and Robert Salomone
Brisbane and Sydney
3
,4 Contents
, CHAPTER 1
I MPORTING , S UMMARIZING , AND
V ISUALIZING DATA
1. Visit the UCI Repository https://archive.ics.uci.edu/. Read the description of
the data and download the Mushroom data set agaricus-lepiota.data. Using pandas,
read the data into a DataFrame called mushroom, via read_csv.
We can import the file directly via its URL:
import pandas as pd
URL = 'http :// archive .ics.uci.edu/ml/machine -learning - databases /
mushroom /agaricus - lepiota .data '
mushroom = pd. read_csv (URL , header =None)
(a) How many features are in this data set?
Solution: There are 23 features.
mushroom .info ()
<class 'pandas .core. frame .DataFrame '>
RangeIndex : 8124 entries , 0 to 8123
Data columns ( total 23 columns ):
0 8124 non -null object
1 8124 non -null object
2 8124 non -null object
3 8124 non -null object
4 8124 non -null object
5 8124 non -null object
6 8124 non -null object
7 8124 non -null object
8 8124 non -null object
9 8124 non -null object
10 8124 non -null object
11 8124 non -null object
12 8124 non -null object
13 8124 non -null object
14 8124 non -null object
15 8124 non -null object
5
,6
16 8124 non -null object
17 8124 non -null object
18 8124 non -null object
19 8124 non -null object
20 8124 non -null object
21 8124 non -null object
22 8124 non -null object
dtypes : object (23)
memory usage : 1.4+ MB
(b) What are the initial names and types of the features?
Solution: From the output of mushroom.info(), we see that the initial names of the
features are 0, 1, 2, . . . , 22, and that they all have the type object.
(c) Rename the first feature (index 0) to 'edibility' and the sixth feature (index 5) to
'odor' [Hint: the column names in pandas are immutable; so individual columns
cannot be modified directly. However it is possible to assign the entire column names
list via mushroom.columns = newcols. ]
Solution:
# create a list object that contains the column names
newcols = mushroom . columns . tolist ()
# assign new values in the list
newcols [0] = 'edibility '
newcols [5] = 'odor '
# replace the column names with our list
mushroom . columns = newcols
(d) The 6th column lists the various odors of the mushrooms: encoded as 'a', 'c', . . . .
Replace these with the names 'almond', 'creosote', etc. (categories correspond-
ing to each letter can be found on the website). Also replace the 'edibility' cat-
egories 'e' and 'p' with 'edible' and 'poisonous'.
Solution:
DICT = {'a': " almond ", 'c': " creosote ", 'f': "foul",
'l':" anise",'m': " musty ",'n':"none", 'p': " pungent ",
's':" spicy", 'y':" fishy "}
mushroom .odor = mushroom .odor. replace (DICT)
DICT = {'e': " edible ", 'p':" poisonous "}
mushroom . edibility = mushroom . edibility . replace (DICT)
(e) Make a contingency table cross-tabulating 'edibility' and 'odor'.
Solution:
pd. crosstab ( mushroom .odor , mushroom . edibility )
edibility edible poisonous
odor
almond 400 0
,Chapter 1. Importing, Summarizing, and Visualizing Data
anise 400 0
creosote 0 192
fishy 0 576
foul 0 2160
musty 0 36
none 3408 120
pungent 0 256
spicy 0 576
(f) Which mushroom odors should be avoided, when gathering mushrooms for consump-
tion?
Solution: From the table in the previous question, we see that the data indicates that
all mushroom odors except almond and anise have observations that are poisonous.
Thus, all odors other than these two should be avoided.
(g) What proportion of odorless mushroom samples were safe to eat?
Solution: We can calculate the proportion by obtaining values directly off the contin-
gency table and directly calculating:
3408/(3408+120)
0.9659863945578231
Alternatively, we can find the answer without the table:
mushroom [( mushroom . edibility == 'edible ') & ( mushroom .odor == '
none ')]. shape [0]/ mushroom [ mushroom .odor =='none ']. shape [0]
0.9659863945578231
2. Change the type and value of variables in the nutri data set according to Table 1.2 and
save the data as a CSV file. The modified data should have eight categorical features, three
floats, and two integer features.
Solution:
import pandas as pd , numpy as np
nutri = pd. read_excel ('nutrition_elderly .xls ')
nutri . gender = nutri . gender . replace ({1: 'Male ' , 2: 'Female '}).
astype ('category ')
nutri . situation = nutri . situation . replace ({1: 'Single ' , 2: 'Couple '
, 3: 'Family '}). astype ('category ')
# create a dictionary that will be used for multiple columns
freq_dict = {0: 'Never ', 1:'< once a week ' , 2: 'once a week ', 3: '
2/3 times a week ', 4: '4-6 times a week ', 5: 'every day '}
cols = ['meat ', 'fish ', 'raw_fruit ', 'cooked_fruit_veg ',
'chocol ']
nutri [cols] = nutri [cols ]. replace ( freq_dict ). astype ('category ')
,8
nutri .fat = nutri .fat. replace ({1: 'Butter ', 2: 'Margarine ', 3: '
Peanut oil ', 4: 'Sunflower oil ', 5: 'Olive oil ', 6: 'Mix of
vegetable oils ', 7: 'Colza oil ', 8: 'Duck or goose fat '}). astype (
'category ')
# assign the float data type to the required columns
cols = ['height ', 'weight ', 'age ']
nutri [cols] = nutri [cols ]. astype ('float ')
We then verify that the modified data has the correct types for each feature:
nutri .info ()
#< class 'pandas .core. frame .DataFrame '>
# RangeIndex : 226 entries , 0 to 225
#Data columns ( total 13 columns ):
# gender 226 non -null category
# situation 226 non -null category
#tea 226 non -null int64
# coffee 226 non -null int64
# height 226 non -null float64
# weight 226 non -null float64
#age 226 non -null float64
#meat 226 non -null category
#fish 226 non -null category
# raw_fruit 226 non -null category
# cooked_fruit_veg 226 non -null category
# chocol 226 non -null category
#fat 226 non -null category
# dtypes : category (8) , float64 (3) , int64 (2)
# memory usage : 12.3 KB
3. It frequently happens that a table with data needs to be restructured before the data can
be analyzed using standard statistical software. As an example, consider the test scores in
Table 1.3 of 5 students before and after specialized tuition.
Table 1.3: Student scores.
Student Before After
1 75 85
2 30 50
3 100 100
4 50 52
5 60 65
This is not in the standard format described in Section 1.1. In particular, the student scores
are divided over two columns, whereas the standard format requires that they are collected
in one column, e.g., labelled 'Score'. Reformat the table in standard format, using three
features:
,Chapter 1. Importing, Summarizing, and Visualizing Data
• 'Score', taking continuous values,
• 'Time', taking values 'Before' and 'After',
• 'Student', taking values from 1 to 5.
Solution: Up to a possible reordering of the rows, your table should look like the one given
below, which was made with the melt method of pandas.
# manually create dataframe with data from table
values = [[1 ,75 ,85] ,[2 ,30 ,50] ,[3 ,100 ,100] ,[4 ,50 ,52] ,[5 ,60 ,65]]
import pandas as pd
df = pd. DataFrame (values , columns =[ 'Student ','Before ', 'After '])
# format dataframe as required
df = pd.melt(df , id_vars =[ 'Student '], var_name ="Time", value_vars =['
Before ','After '])
print (df)
Student Time value
0 1 Before 75
1 2 Before 30
2 3 Before 100
3 4 Before 50
4 5 Before 60
5 1 After 85
6 2 After 50
7 3 After 100
8 4 After 52
9 5 After 65
4. Create a similar barplot as in Figure 1.5, but now plot the corresponding proportions of
males and females in each of the three situation categories. That is, the heights of the bars
should sum up to 1 for both barplots with the same ’gender’ value. [Hint: seaborn does
not have this functionality built in, instead you need to first create a contingency table and
use matplotlib.pyplot to produce the figure.]
Solution:
import pandas as pd
import numpy as np
import matplotlib . pyplot as plt
xls = 'http :// www. biostatisticien .eu/ springeR / nutrition_elderly .xls '
nutri = pd. read_excel (xls)
nutri . gender = nutri . gender . replace ({1: 'Male ' , 2: 'Female '}).
astype ('category ')
nutri . situation = nutri . situation . replace ({1: 'Single ' , 2: 'Couple '
, 3: 'Family '}). astype ('category ')
contingencyTable = pd. crosstab ( nutri .gender , nutri . situation )
male_counts = contingencyTable . stack ().Male
, 10
female_counts = contingencyTable . stack (). Female
xind = np. arange (len( nutri . situation . unique ()))
width = 0.3
plt. figure ( figsize =[5 ,3])
plt.bar(xind - width /2, male_counts / male_counts .sum () , width , color ='
SkyBlue ', label ='Men ', edgecolor ='black ')
plt.bar(xind + width /2, female_counts / female_counts .sum () , width ,
color ='Pink ', label ='Women ', edgecolor ='black ')
plt. ylabel ('Proportions ')
plt. xticks (xind + width /2, contingencyTable . columns )
plt. legend (loc = (0.4 ,0.7))
0.7
Men
0.6 Women
0.5
Proportions
0.4
0.3
0.2
0.1
0.0
Couple Family Single
+2 5. The iris data set, mentioned in Section 1.1, contains various features, including
'Petal.Length' and 'Sepal.Length', of three species of iris: setosa, versicolor, and
virginica.
(a) Load the data set into a pandas DataFrame object.
Solution:
import pandas as pd
urlprefix = 'http :// vincentarelbundock . github .io/ Rdatasets /csv/'
dataname = 'datasets /iris.csv '
iris = pd. read_csv ( urlprefix + dataname )
(b) Using matplotlib.pyplot, produce boxplots of 'Petal.Length' for each the
three species, in one figure.
Solution:
import matplotlib . pyplot as plt
labels = [" setosa "," versicolor "," virginica "]
plt. boxplot ([ setosa [" Sepal . Length "], versicolor [" Sepal . Length "],
virginica [" Sepal . Length "]], labels = labels )