100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Exam (elaborations)

Solutions Manual for Data Science and Machine Learning: Mathematical and Statistical Methods 1st Edition by Dirk P. Kroese, Zdravko Botev, Thomas Taimre, Radislav Vaisman

Rating
-
Sold
-
Pages
175
Grade
A+
Uploaded on
28-04-2025
Written in
2024/2025

Solutions Manual for Data Science and Machine Learning: Mathematical and Statistical Methods 1st Edition by Dirk P. Kroese, Zdravko Botev, Thomas Taimre, Radislav Vaisman

Institution
Data Science And Machine Learning: Mathematical
Course
Data Science and Machine Learning: Mathematical

















Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Data Science and Machine Learning: Mathematical
Course
Data Science and Machine Learning: Mathematical

Document information

Uploaded on
April 28, 2025
Number of pages
175
Written in
2024/2025
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

Content preview

Solutions Manual

to Accompany
Data Science and Machine Learning:
Mathematical and Statistical Methods




Dirk P. Kroese Zdravko I. Botev Thomas Taimre
Slava Vaisman Robert Salomone

8th January 2020

,CONTENTS



Preface 3

1 Importing, Summarizing, and Visualizing Data 5

2 Statistical Learning 17

3 Monte Carlo Methods 35

4 Unsupervised Learning 65

5 Regression 79

6 Kernel Methods 99

7 Classification 115

8 Tree Methods 139

9 Deep Learning 149




2

, P REFACE




We believe that the only effective way to master the theory and practice of Data Science
and Machine learning is through exercises and experiments. For this reason, we included
many exercises and algorithms in Data Science and Machine Learning: Mathematical and
Statistical Methods (DSML), Chapman and Hall/CRC, 2019.
This companion volume to DSML is written in the same style and contains a wealth of
additional material: worked solutions for the over 150 exercises in DSML, many Python
programs and additional illustrations.
Like DSML, this solution manual is aimed at anyone interested in gaining a better un-
derstanding of the mathematics and statistics that underpin the rich variety of ideas and
machine learning algorithms in data science. One of the main goals of the manual is to
provide a comprehensive solutions guide to instructors, which will aid student assessment
and stimulate further student development. In addition, this manual offers a unique com-
plement to DSML for self-study. All too often a stumbling block for learning is the un-
availability of worked solutions and actual algorithms.
The solutions manual covers a wide range of exercises in data analysis, statistical
learning, Monte Carlo methods, unsupervised learning, regression, regularization and ker-
nel methods, classification, decision trees and ensemble methods, and deep learning. Our
choice of using Python was motivated by its ease of use and clarity of syntax.
Reference numbers to DSML are indicated in boldface blue font. For example, Defini-
tion 1.1.1 refers to the corresponding definition in DSML, and (1.7) refers to equation (1.7)
in DSML, whereas Figure 1.1 refers to the first numbered figure in the present document.
This solutions manual was financially supported by the Australian Research Coun-
cil Centre of Excellence for Mathematical & Statistical Frontiers, under grant number
CE140100049.


Dirk Kroese, Zdravko Botev,
Thomas Taimre, Radislav Vaisman, and Robert Salomone
Brisbane and Sydney

3

,4 Contents

, CHAPTER 1

I MPORTING , S UMMARIZING , AND
V ISUALIZING DATA


1. Visit the UCI Repository https://archive.ics.uci.edu/. Read the description of
the data and download the Mushroom data set agaricus-lepiota.data. Using pandas,
read the data into a DataFrame called mushroom, via read_csv.
We can import the file directly via its URL:

import pandas as pd

URL = 'http :// archive .ics.uci.edu/ml/machine -learning - databases /
mushroom /agaricus - lepiota .data '
mushroom = pd. read_csv (URL , header =None)


(a) How many features are in this data set?

Solution: There are 23 features.

mushroom .info ()
<class 'pandas .core. frame .DataFrame '>
RangeIndex : 8124 entries , 0 to 8123
Data columns ( total 23 columns ):
0 8124 non -null object
1 8124 non -null object
2 8124 non -null object
3 8124 non -null object
4 8124 non -null object
5 8124 non -null object
6 8124 non -null object
7 8124 non -null object
8 8124 non -null object
9 8124 non -null object
10 8124 non -null object
11 8124 non -null object
12 8124 non -null object
13 8124 non -null object
14 8124 non -null object
15 8124 non -null object

5

,6


16 8124 non -null object
17 8124 non -null object
18 8124 non -null object
19 8124 non -null object
20 8124 non -null object
21 8124 non -null object
22 8124 non -null object
dtypes : object (23)
memory usage : 1.4+ MB

(b) What are the initial names and types of the features?
Solution: From the output of mushroom.info(), we see that the initial names of the
features are 0, 1, 2, . . . , 22, and that they all have the type object.
(c) Rename the first feature (index 0) to 'edibility' and the sixth feature (index 5) to
'odor' [Hint: the column names in pandas are immutable; so individual columns
cannot be modified directly. However it is possible to assign the entire column names
list via mushroom.columns = newcols. ]
Solution:
# create a list object that contains the column names
newcols = mushroom . columns . tolist ()

# assign new values in the list
newcols [0] = 'edibility '
newcols [5] = 'odor '

# replace the column names with our list
mushroom . columns = newcols

(d) The 6th column lists the various odors of the mushrooms: encoded as 'a', 'c', . . . .
Replace these with the names 'almond', 'creosote', etc. (categories correspond-
ing to each letter can be found on the website). Also replace the 'edibility' cat-
egories 'e' and 'p' with 'edible' and 'poisonous'.
Solution:
DICT = {'a': " almond ", 'c': " creosote ", 'f': "foul",
'l':" anise",'m': " musty ",'n':"none", 'p': " pungent ",
's':" spicy", 'y':" fishy "}
mushroom .odor = mushroom .odor. replace (DICT)

DICT = {'e': " edible ", 'p':" poisonous "}
mushroom . edibility = mushroom . edibility . replace (DICT)

(e) Make a contingency table cross-tabulating 'edibility' and 'odor'.
Solution:
pd. crosstab ( mushroom .odor , mushroom . edibility )
edibility edible poisonous
odor
almond 400 0

,Chapter 1. Importing, Summarizing, and Visualizing Data


anise 400 0
creosote 0 192
fishy 0 576
foul 0 2160
musty 0 36
none 3408 120
pungent 0 256
spicy 0 576

(f) Which mushroom odors should be avoided, when gathering mushrooms for consump-
tion?
Solution: From the table in the previous question, we see that the data indicates that
all mushroom odors except almond and anise have observations that are poisonous.
Thus, all odors other than these two should be avoided.
(g) What proportion of odorless mushroom samples were safe to eat?
Solution: We can calculate the proportion by obtaining values directly off the contin-
gency table and directly calculating:
3408/(3408+120)
0.9659863945578231

Alternatively, we can find the answer without the table:
mushroom [( mushroom . edibility == 'edible ') & ( mushroom .odor == '
none ')]. shape [0]/ mushroom [ mushroom .odor =='none ']. shape [0]
0.9659863945578231


2. Change the type and value of variables in the nutri data set according to Table 1.2 and
save the data as a CSV file. The modified data should have eight categorical features, three
floats, and two integer features.
Solution:
import pandas as pd , numpy as np

nutri = pd. read_excel ('nutrition_elderly .xls ')

nutri . gender = nutri . gender . replace ({1: 'Male ' , 2: 'Female '}).
astype ('category ')

nutri . situation = nutri . situation . replace ({1: 'Single ' , 2: 'Couple '
, 3: 'Family '}). astype ('category ')

# create a dictionary that will be used for multiple columns
freq_dict = {0: 'Never ', 1:'< once a week ' , 2: 'once a week ', 3: '
2/3 times a week ', 4: '4-6 times a week ', 5: 'every day '}
cols = ['meat ', 'fish ', 'raw_fruit ', 'cooked_fruit_veg ',
'chocol ']
nutri [cols] = nutri [cols ]. replace ( freq_dict ). astype ('category ')

,8


nutri .fat = nutri .fat. replace ({1: 'Butter ', 2: 'Margarine ', 3: '
Peanut oil ', 4: 'Sunflower oil ', 5: 'Olive oil ', 6: 'Mix of
vegetable oils ', 7: 'Colza oil ', 8: 'Duck or goose fat '}). astype (
'category ')

# assign the float data type to the required columns
cols = ['height ', 'weight ', 'age ']
nutri [cols] = nutri [cols ]. astype ('float ')


We then verify that the modified data has the correct types for each feature:

nutri .info ()
#< class 'pandas .core. frame .DataFrame '>
# RangeIndex : 226 entries , 0 to 225
#Data columns ( total 13 columns ):
# gender 226 non -null category
# situation 226 non -null category
#tea 226 non -null int64
# coffee 226 non -null int64
# height 226 non -null float64
# weight 226 non -null float64
#age 226 non -null float64
#meat 226 non -null category
#fish 226 non -null category
# raw_fruit 226 non -null category
# cooked_fruit_veg 226 non -null category
# chocol 226 non -null category
#fat 226 non -null category
# dtypes : category (8) , float64 (3) , int64 (2)
# memory usage : 12.3 KB


3. It frequently happens that a table with data needs to be restructured before the data can
be analyzed using standard statistical software. As an example, consider the test scores in
Table 1.3 of 5 students before and after specialized tuition.

Table 1.3: Student scores.

Student Before After
1 75 85
2 30 50
3 100 100
4 50 52
5 60 65


This is not in the standard format described in Section 1.1. In particular, the student scores
are divided over two columns, whereas the standard format requires that they are collected
in one column, e.g., labelled 'Score'. Reformat the table in standard format, using three
features:

,Chapter 1. Importing, Summarizing, and Visualizing Data


• 'Score', taking continuous values,
• 'Time', taking values 'Before' and 'After',
• 'Student', taking values from 1 to 5.

Solution: Up to a possible reordering of the rows, your table should look like the one given
below, which was made with the melt method of pandas.

# manually create dataframe with data from table
values = [[1 ,75 ,85] ,[2 ,30 ,50] ,[3 ,100 ,100] ,[4 ,50 ,52] ,[5 ,60 ,65]]

import pandas as pd
df = pd. DataFrame (values , columns =[ 'Student ','Before ', 'After '])

# format dataframe as required
df = pd.melt(df , id_vars =[ 'Student '], var_name ="Time", value_vars =['
Before ','After '])
print (df)
Student Time value
0 1 Before 75
1 2 Before 30
2 3 Before 100
3 4 Before 50
4 5 Before 60
5 1 After 85
6 2 After 50
7 3 After 100
8 4 After 52
9 5 After 65


4. Create a similar barplot as in Figure 1.5, but now plot the corresponding proportions of
males and females in each of the three situation categories. That is, the heights of the bars
should sum up to 1 for both barplots with the same ’gender’ value. [Hint: seaborn does
not have this functionality built in, instead you need to first create a contingency table and
use matplotlib.pyplot to produce the figure.]
Solution:
import pandas as pd
import numpy as np
import matplotlib . pyplot as plt

xls = 'http :// www. biostatisticien .eu/ springeR / nutrition_elderly .xls '
nutri = pd. read_excel (xls)

nutri . gender = nutri . gender . replace ({1: 'Male ' , 2: 'Female '}).
astype ('category ')
nutri . situation = nutri . situation . replace ({1: 'Single ' , 2: 'Couple '
, 3: 'Family '}). astype ('category ')

contingencyTable = pd. crosstab ( nutri .gender , nutri . situation )
male_counts = contingencyTable . stack ().Male

, 10


female_counts = contingencyTable . stack (). Female
xind = np. arange (len( nutri . situation . unique ()))
width = 0.3
plt. figure ( figsize =[5 ,3])
plt.bar(xind - width /2, male_counts / male_counts .sum () , width , color ='
SkyBlue ', label ='Men ', edgecolor ='black ')
plt.bar(xind + width /2, female_counts / female_counts .sum () , width ,
color ='Pink ', label ='Women ', edgecolor ='black ')
plt. ylabel ('Proportions ')
plt. xticks (xind + width /2, contingencyTable . columns )
plt. legend (loc = (0.4 ,0.7))




0.7
Men
0.6 Women
0.5
Proportions




0.4
0.3
0.2
0.1
0.0
Couple Family Single



+2 5. The iris data set, mentioned in Section 1.1, contains various features, including
'Petal.Length' and 'Sepal.Length', of three species of iris: setosa, versicolor, and
virginica.

(a) Load the data set into a pandas DataFrame object.
Solution:

import pandas as pd
urlprefix = 'http :// vincentarelbundock . github .io/ Rdatasets /csv/'
dataname = 'datasets /iris.csv '
iris = pd. read_csv ( urlprefix + dataname )


(b) Using matplotlib.pyplot, produce boxplots of 'Petal.Length' for each the
three species, in one figure.
Solution:

import matplotlib . pyplot as plt
labels = [" setosa "," versicolor "," virginica "]
plt. boxplot ([ setosa [" Sepal . Length "], versicolor [" Sepal . Length "],
virginica [" Sepal . Length "]], labels = labels )

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
AcademiContent Aalborg University
View profile
Follow You need to be logged in order to follow users or courses
Sold
3053
Member since
6 year
Number of followers
2132
Documents
1236
Last sold
1 day ago

4.0

385 reviews

5
203
4
83
3
38
2
17
1
44

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions