Exam (elaborations)

Solutions Manual for Data Science and Machine Learning: Mathematical and Statistical Methods 1st Edition by Dirk P. Kroese, Zdravko Botev, Thomas Taimre, Radislav Vaisman

Rating

Sold

Pages

175

Grade

A+

Uploaded on

28-04-2025

Written in

2024/2025

Solutions Manual for Data Science and Machine Learning: Mathematical and Statistical Methods 1st Edition by Dirk P. Kroese, Zdravko Botev, Thomas Taimre, Radislav Vaisman

Institution

Data Science And Machine Learning: Mathematical

Course

Data Science and Machine Learning: Mathematical

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Connected book

Dirk P. Kroese, Zdravko Botev, Thomas Taimre, Radislav Vaisman Data Science and Machine Learning

Edition:2019
ISBN:9781000730777
Edition:Unknown

Written for

Institution: Data Science and Machine Learning: Mathematical
Course: Data Science and Machine Learning: Mathematical

Document information

Uploaded on: April 28, 2025
Number of pages: 175
Written in: 2024/2025
Type: Exam (elaborations)
Contains: Questions & answers

Subjects

data science and machine learning mathematical an
data science and machine learning
solutions manual for data science and machine lear
solutions manual

Content preview

Solutions Manual

to Accompany
Data Science and Machine Learning:
Mathematical and Statistical Methods

Dirk P. Kroese Zdravko I. Botev Thomas Taimre
Slava Vaisman Robert Salomone

8th January 2020

,CONTENTS

Preface 3

1 Importing, Summarizing, and Visualizing Data 5

2 Statistical Learning 17

3 Monte Carlo Methods 35

4 Unsupervised Learning 65

5 Regression 79

6 Kernel Methods 99

7 Classification 115

8 Tree Methods 139

9 Deep Learning 149

2

, P REFACE

We believe that the only effective way to master the theory and practice of Data Science
and Machine learning is through exercises and experiments. For this reason, we included
many exercises and algorithms in Data Science and Machine Learning: Mathematical and
Statistical Methods (DSML), Chapman and Hall/CRC, 2019.
This companion volume to DSML is written in the same style and contains a wealth of
additional material: worked solutions for the over 150 exercises in DSML, many Python
programs and additional illustrations.
Like DSML, this solution manual is aimed at anyone interested in gaining a better un-
derstanding of the mathematics and statistics that underpin the rich variety of ideas and
machine learning algorithms in data science. One of the main goals of the manual is to
provide a comprehensive solutions guide to instructors, which will aid student assessment
and stimulate further student development. In addition, this manual offers a unique com-
plement to DSML for self-study. All too often a stumbling block for learning is the un-
availability of worked solutions and actual algorithms.
The solutions manual covers a wide range of exercises in data analysis, statistical
learning, Monte Carlo methods, unsupervised learning, regression, regularization and ker-
nel methods, classification, decision trees and ensemble methods, and deep learning. Our
choice of using Python was motivated by its ease of use and clarity of syntax.
Reference numbers to DSML are indicated in boldface blue font. For example, Defini-
tion 1.1.1 refers to the corresponding definition in DSML, and (1.7) refers to equation (1.7)
in DSML, whereas Figure 1.1 refers to the first numbered figure in the present document.
This solutions manual was financially supported by the Australian Research Coun-
cil Centre of Excellence for Mathematical & Statistical Frontiers, under grant number
CE140100049.

Dirk Kroese, Zdravko Botev,
Thomas Taimre, Radislav Vaisman, and Robert Salomone
Brisbane and Sydney

3

,4 Contents

, CHAPTER 1

I MPORTING , S UMMARIZING , AND
V ISUALIZING DATA

1. Visit the UCI Repository https://archive.ics.uci.edu/. Read the description of
the data and download the Mushroom data set agaricus-lepiota.data. Using pandas,
read the data into a DataFrame called mushroom, via read_csv.
We can import the file directly via its URL:

import pandas as pd

URL = 'http :// archive .ics.uci.edu/ml/machine -learning - databases /
mushroom /agaricus - lepiota .data '
mushroom = pd. read_csv (URL , header =None)

(a) How many features are in this data set?

Solution: There are 23 features.

mushroom .info ()
<class 'pandas .core. frame .DataFrame '>
RangeIndex : 8124 entries , 0 to 8123
Data columns ( total 23 columns ):
0 8124 non -null object
1 8124 non -null object
2 8124 non -null object
3 8124 non -null object
4 8124 non -null object
5 8124 non -null object
6 8124 non -null object
7 8124 non -null object
8 8124 non -null object
9 8124 non -null object
10 8124 non -null object
11 8124 non -null object
12 8124 non -null object
13 8124 non -null object
14 8124 non -null object
15 8124 non -null object

5

,6

16 8124 non -null object
17 8124 non -null object
18 8124 non -null object
19 8124 non -null object
20 8124 non -null object
21 8124 non -null object
22 8124 non -null object
dtypes : object (23)
memory usage : 1.4+ MB

(b) What are the initial names and types of the features?
Solution: From the output of mushroom.info(), we see that the initial names of the
features are 0, 1, 2, . . . , 22, and that they all have the type object.
(c) Rename the first feature (index 0) to 'edibility' and the sixth feature (index 5) to
'odor' [Hint: the column names in pandas are immutable; so individual columns
cannot be modified directly. However it is possible to assign the entire column names
list via mushroom.columns = newcols. ]
Solution:
# create a list object that contains the column names
newcols = mushroom . columns . tolist ()

# assign new values in the list
newcols [0] = 'edibility '
newcols [5] = 'odor '

# replace the column names with our list
mushroom . columns = newcols

(d) The 6th column lists the various odors of the mushrooms: encoded as 'a', 'c', . . . .
Replace these with the names 'almond', 'creosote', etc. (categories correspond-
ing to each letter can be found on the website). Also replace the 'edibility' cat-
egories 'e' and 'p' with 'edible' and 'poisonous'.
Solution:
DICT = {'a': " almond ", 'c': " creosote ", 'f': "foul",
'l':" anise",'m': " musty ",'n':"none", 'p': " pungent ",
's':" spicy", 'y':" fishy "}
mushroom .odor = mushroom .odor. replace (DICT)

DICT = {'e': " edible ", 'p':" poisonous "}
mushroom . edibility = mushroom . edibility . replace (DICT)

(e) Make a contingency table cross-tabulating 'edibility' and 'odor'.
Solution:
pd. crosstab ( mushroom .odor , mushroom . edibility )
edibility edible poisonous
odor
almond 400 0

,Chapter 1. Importing, Summarizing, and Visualizing Data

anise 400 0
creosote 0 192
fishy 0 576
foul 0 2160
musty 0 36
none 3408 120
pungent 0 256
spicy 0 576

(f) Which mushroom odors should be avoided, when gathering mushrooms for consump-
tion?
Solution: From the table in the previous question, we see that the data indicates that
all mushroom odors except almond and anise have observations that are poisonous.
Thus, all odors other than these two should be avoided.
(g) What proportion of odorless mushroom samples were safe to eat?
Solution: We can calculate the proportion by obtaining values directly off the contin-
gency table and directly calculating:
3408/(3408+120)
0.9659863945578231

Alternatively, we can find the answer without the table:
mushroom [( mushroom . edibility == 'edible ') & ( mushroom .odor == '
none ')]. shape [0]/ mushroom [ mushroom .odor =='none ']. shape [0]
0.9659863945578231

2. Change the type and value of variables in the nutri data set according to Table 1.2 and
save the data as a CSV file. The modified data should have eight categorical features, three
floats, and two integer features.
Solution:
import pandas as pd , numpy as np

nutri = pd. read_excel ('nutrition_elderly .xls ')

nutri . gender = nutri . gender . replace ({1: 'Male ' , 2: 'Female '}).
astype ('category ')

nutri . situation = nutri . situation . replace ({1: 'Single ' , 2: 'Couple '
, 3: 'Family '}). astype ('category ')

# create a dictionary that will be used for multiple columns
freq_dict = {0: 'Never ', 1:'< once a week ' , 2: 'once a week ', 3: '
2/3 times a week ', 4: '4-6 times a week ', 5: 'every day '}
cols = ['meat ', 'fish ', 'raw_fruit ', 'cooked_fruit_veg ',
'chocol ']
nutri [cols] = nutri [cols ]. replace ( freq_dict ). astype ('category ')

,8

nutri .fat = nutri .fat. replace ({1: 'Butter ', 2: 'Margarine ', 3: '
Peanut oil ', 4: 'Sunflower oil ', 5: 'Olive oil ', 6: 'Mix of
vegetable oils ', 7: 'Colza oil ', 8: 'Duck or goose fat '}). astype (
'category ')

# assign the float data type to the required columns
cols = ['height ', 'weight ', 'age ']
nutri [cols] = nutri [cols ]. astype ('float ')

We then verify that the modified data has the correct types for each feature:

nutri .info ()
#< class 'pandas .core. frame .DataFrame '>
# RangeIndex : 226 entries , 0 to 225
#Data columns ( total 13 columns ):
# gender 226 non -null category
# situation 226 non -null category
#tea 226 non -null int64
# coffee 226 non -null int64
# height 226 non -null float64
# weight 226 non -null float64
#age 226 non -null float64
#meat 226 non -null category
#fish 226 non -null category
# raw_fruit 226 non -null category
# cooked_fruit_veg 226 non -null category
# chocol 226 non -null category
#fat 226 non -null category
# dtypes : category (8) , float64 (3) , int64 (2)
# memory usage : 12.3 KB

3. It frequently happens that a table with data needs to be restructured before the data can
be analyzed using standard statistical software. As an example, consider the test scores in
Table 1.3 of 5 students before and after specialized tuition.

Table 1.3: Student scores.

Student Before After
1 75 85
2 30 50
3 100 100
4 50 52
5 60 65

This is not in the standard format described in Section 1.1. In particular, the student scores
are divided over two columns, whereas the standard format requires that they are collected
in one column, e.g., labelled 'Score'. Reformat the table in standard format, using three
features:

,Chapter 1. Importing, Summarizing, and Visualizing Data

• 'Score', taking continuous values,
• 'Time', taking values 'Before' and 'After',
• 'Student', taking values from 1 to 5.

Solution: Up to a possible reordering of the rows, your table should look like the one given
below, which was made with the melt method of pandas.

# manually create dataframe with data from table
values = [[1 ,75 ,85] ,[2 ,30 ,50] ,[3 ,100 ,100] ,[4 ,50 ,52] ,[5 ,60 ,65]]

import pandas as pd
df = pd. DataFrame (values , columns =[ 'Student ','Before ', 'After '])

# format dataframe as required
df = pd.melt(df , id_vars =[ 'Student '], var_name ="Time", value_vars =['
Before ','After '])
print (df)
Student Time value
0 1 Before 75
1 2 Before 30
2 3 Before 100
3 4 Before 50
4 5 Before 60
5 1 After 85
6 2 After 50
7 3 After 100
8 4 After 52
9 5 After 65

4. Create a similar barplot as in Figure 1.5, but now plot the corresponding proportions of
males and females in each of the three situation categories. That is, the heights of the bars
should sum up to 1 for both barplots with the same ’gender’ value. [Hint: seaborn does
not have this functionality built in, instead you need to first create a contingency table and
use matplotlib.pyplot to produce the figure.]
Solution:
import pandas as pd
import numpy as np
import matplotlib . pyplot as plt

xls = 'http :// www. biostatisticien .eu/ springeR / nutrition_elderly .xls '
nutri = pd. read_excel (xls)

nutri . gender = nutri . gender . replace ({1: 'Male ' , 2: 'Female '}).
astype ('category ')
nutri . situation = nutri . situation . replace ({1: 'Single ' , 2: 'Couple '
, 3: 'Family '}). astype ('category ')

contingencyTable = pd. crosstab ( nutri .gender , nutri . situation )
male_counts = contingencyTable . stack ().Male

, 10

female_counts = contingencyTable . stack (). Female
xind = np. arange (len( nutri . situation . unique ()))
width = 0.3
plt. figure ( figsize =[5 ,3])
plt.bar(xind - width /2, male_counts / male_counts .sum () , width , color ='
SkyBlue ', label ='Men ', edgecolor ='black ')
plt.bar(xind + width /2, female_counts / female_counts .sum () , width ,
color ='Pink ', label ='Women ', edgecolor ='black ')
plt. ylabel ('Proportions ')
plt. xticks (xind + width /2, contingencyTable . columns )
plt. legend (loc = (0.4 ,0.7))

0.7
Men
0.6 Women
0.5
Proportions

0.4
0.3
0.2
0.1
0.0
Couple Family Single

+2 5. The iris data set, mentioned in Section 1.1, contains various features, including
'Petal.Length' and 'Sepal.Length', of three species of iris: setosa, versicolor, and
virginica.

(a) Load the data set into a pandas DataFrame object.
Solution:

import pandas as pd
urlprefix = 'http :// vincentarelbundock . github .io/ Rdatasets /csv/'
dataname = 'datasets /iris.csv '
iris = pd. read_csv ( urlprefix + dataname )

(b) Using matplotlib.pyplot, produce boxplots of 'Petal.Length' for each the
three species, in one figure.
Solution:

import matplotlib . pyplot as plt
labels = [" setosa "," versicolor "," virginica "]
plt. boxplot ([ setosa [" Sepal . Length "], versicolor [" Sepal . Length "],
virginica [" Sepal . Length "]], labels = labels )

$33.54

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

AcademiContent

4.0

(385)

Get to know the seller

AcademiContent Aalborg University

View profile

Sold

3053

Member since

6 year

Number of followers

2132

Documents

1236

Last sold

1 day ago

4.0

385 reviews

203

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller AcademiContent. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $33.54. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 50201 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 15 years now

Solutions Manual for Data Science and Machine Learning: Mathematical and Statistical Methods 1st Edition by Dirk P. Kroese, Zdravko Botev, Thomas Taimre, Radislav Vaisman

Connected book

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?