Scientists With R and Python, 1e by Alan Agresti, Maria Kater
(All Chapters)
Chapter 1
1.1 (a) (i) an individual voter, (ii) the 1882 voters in the exit poll, (iii) the 11.1 million
people who voted
(b) Statistic: Sample percentage of 52.5% who voted for Feinstein
Parameter: Population percentage of 54.2% who voted for Feinstein
1.2 (a) Use a command such as in R,
> Students <- read.table("
+ header=TRUE)
(b) (i) What proportion of the students in this sample responded yes for whether
abortion should be legal in the first three months; (ii) Same question but for some
population, such as all social science graduate students at the University of Florida
1.3 (a) Quantitative; (b) categorical; (c) categorical; (d) quantitative
1.4 (a) Religious affiffiliation (possible categories Christianity, Islam, Jewish,
Hinduism,
Buddhism, other, none)
(b) Body/mass index (BMI = (weight in kg)/(height in meters)2
(c) Number of children in family
(d) Height of a person
1.5 Ordinal, because categories have natural ordering
1.6 (a) College board score (e.g., SAT between 200 and 800)
(b) Time spent in college (measure by integer number of years)
1.7 In R, for students numbered 00001 to 52000,
> sample(1:52000, 10)
[1] 1687 18236 26783 35366 14244 11429 20973 31436 48476
1.8 (a) observational, (b) experiment (c) observational, (d) experiment
1.9 Median = 4, mode = 2, expect mean larger than median because distribution is skewed
right
1.10 (a)
3925
1
, 2 Solutions Manual: Foundations of Statistical Science for Data Scientists
> Carbon <- read.table("http://stat4ds.rwth-aachen.de/data/Carbon_West.dat",
+ header=TRUE)
> breaks <- seq(2.0, 18.0, by=2.0)
> freq <- table(cut(Carbon$CO2, breaks, right=FALSE))
> cbind(freq, freq/nrow(Carbon))
freq
[2,4) 4 0.11428571
[4,6) 15 0.42857143
[6,8) 7 0.20000000
[8,10) 6 0.17142857
[10,12) 0 0.00000000
[12,14) 0 0.00000000
[14,16) 2 0.05714286
[16,18) 1 0.02857143
> hist(Carbon$CO2)
(b) Mean = 6.72, median = 5.90, standard deviation = 3.36
mean(Carbon$CO2); median(Carbon$CO2); sd(Carbon$CO2)
1.11 Skewed to the right, because the mean is much larger than the median.
1.12 Number of times you went to a gym in the last week; median = 0 if more than half of
persons in the sample never went.
1.13 (a) 63,000 to 75,000; (b) 57,000 to 81,000; (c) 51,000 to 87,000. 100,000 would be unusual
because it is more than 5 standard deviations above the mean.
1.14 A quarter of the states had less that 6% without insurance, and a quarter had more than
9.5% without insurance. Half the states had between 6% and 9.5% without insurance,
encompassing an interquartile range of 3.5%.
1.15 Skewed to the right, because distances of median from LQ and minimum are less than
from UQ and maximum.
1.16 (a) The percentages in 2018 (with the default composite weight) for (0, 1, 2, 3, 4, 5,
6, ≥ 7) are (9.4, 24.8, 24.9, 14.8, 10.7, 5.3, 3.5, 6.7), somewhat skewed to the right.
(b) Mode = 2, median = 2
(c) Mean = 2.8, standard deviation = 2.6. The lowest possible observation is only
slightly more than a standard deviation below the mean, whereas in bell-shaped
distributions, observations can occur two or three standard deviations from the
mean in each direction.
1.17 > Murder <- read.table("http://stat4ds.rwth-aachen.de/data/Murder.dat", header=TRUE)
> Murder1 <- Murder[Murder$state!="DC",] # data frame without D.C.
(a) Mean = 4.87, standard deviation = 2.59
> mean(Murder1$murder); sd(Murder1$murder)
(b) Minimum = 1.0, LQ = 2.6, median = 4.85, UQ = 6.2, maximum = 12.4, somewhat
skewed right
> summary(Murder1$murder); boxplot(Murder1$murder)
(c) Repeat the analysis above for Murder1$murder. The DC is a large outlier, causing
the mean to increase (from 4.87 to 5.25) and the range to increase dramatically
(from 11.4 to 23.2).
1.18 (a) Histogram is skewed right.
, Solutions Manual: Foundations of Statistical Science for Data Scientists 3
> Income <- read.table("http://stat4ds.rwth-aachen.de/data/Income.dat",
+ header=TRUE); attach(Income)
> hist(income)
(b) Five number summary is min. = 16, lower quartile = 22, median = 30, upper
quartile = 465, max. = 120; also mean = 37.52 and standard deviation = 20.67.
> summary(income); sd(income)
(c) Density approximation with default bandwidth = 6.85 is skewed right. Increasing
the bandwidth (such as to 12) makes the curve smoother and bell-shaped, but still
skewed. Decreasing it (such as to 3) makes it much bumpier and probably a poorer
portrayal of a corresponding population distribution.
> plot(density(income)) # default bandwidth = 6.85
> plot(density(income, bw=12))
(d) > boxplot(income ~ race, xlab="Income", horizontal=TRUE)
> tapply(income, race, summary)
$B
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.00 19.50 24.00 27.75 31.00 66.00
$H
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.0 20.5 30.0 31.0 32.0 58.0
$W
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 24.00 37.00 42.48 50.00 120.00
> install.packages("tidyverse")
> library(tidyverse)
> Income %>% group_by(race) %>% summarize(n=n(),mean=mean(income),sd=sd(income))
race n mean sd
1 B 16 27.8 13.3
2 H 14 31 12.8
3 W 50 42.5 22.9
1.19 (a) Highly skewed right
> Houses <- read.table("http://stat4ds.rwth-aachen.de/data/Houses.dat",
+ header=TRUE); attach(Houses)
> PriceH <- hist(price); hist(price) # save histogram to use its breaks
> breaks <- PriceH$breaks # breaks used in histogram
> freq <- table(cut(Houses$price,breaks, right=FALSE))
> cbind(freq,freq/nrow(Houses)) # frequency table (not shown)
(b) y = 233.0, s = 151.9; 85%, not close to 68% because not bell-shaped but highly
skewed
> length(case[mean(price)-sd(price)<price & price<mean(price+sd(price)]) /
+ nrow(Houses)
(c) The boxplot shows many large observations that are outliers.
> boxplot(price)
(d) > tapply(Houses$price, Houses$new, summary)
$`0`
Min. 1st Qu. Median Mean 3rd Qu. Max.
31.5 135.0 190.8 207.9 240.0 880.5
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
158.8 256.9 427.5 436.4 519.7 866.2
New homes tend to have higher selling prices.
1.20 (a) Clear trend that price tends to increase as size increases.
, 4 Solutions Manual: Foundations of Statistical Science for Data Scientists
> plot(size, price)
(b) 0.834, strong positive association
> cor(size, price)
(c) Predicted price = −76.39 + 0.19(size), which is 113.5 thousand dollars at 1000
square feet and 683.2 thousand dollars at 4000 square feet.
> summary(lm(price ~ size)) # linear model: read the coefficients estimates
> pred <- function(x){-76.3894+0.1899*x}; pred(1000); pred(4000)
1.21 Correlation = 0.278 (positive but weak), predicted college GPA is 2.75 + 0.22(high
school GPA), which is 3.6 for high school GPA of 4.0.
1.22 > Happy <- read.table("http://stat4ds.rwth-aachen.de/data/Happy.dat", header=TRUE)
> Happiness <- factor(Happy$happiness); Marital <- factor(Happy$marital)
> levels(Happiness) <- c("Very happy", "Pretty happy", "Not too happy")
> levels(Marital) <- c("Married", "Divorced/Separated", "Never married")
> table(Marital, Happiness) # forms contingency table
Happiness
Marital Very happy Pretty happy Not too happy
Married 432 504 61
Divorced/Separated 92 282 103
Never married 124 409 135
> prop.table(table(Marital,Happiness), 1)
Happiness
Marital Very happy Pretty happy Not too happy
Married 0.43329990 0.50551655 0.06118355
Divorced/Separated 0.19287212 0.59119497 0.21593291
Never married 0.18562874 0.61227545 0.20209581
Married subjects are more likely to be very happy and less likely to be not too happy
than the other subjects.
1.23 > attach(Students)
> table(relig, abor)
abor
relig 0 1
0 1 14
1 4 25
2 1 6
3 7 2
The very religious (attending every week) are less likely to support legal abortion (only
2 of the 9 observations in support).
1.24 (a) Values are skewed right, with mean 153.9 and median 119.8 and a very high outlier
of 716 for the U.S.
(b) 0.90 between GDP and HDI.
(c) correlation = 0.674, predicted CO2 = 1.926 + 0.178(GDP), which increases dramat-
ically between 2.71 at the minimum GDP = 4.4 and 13.11 at the maximum.GDP
= 62.9.
1.25 > Races <- read.table("http://stat4ds.rwth-aachen.de/data/ScotsRaces.dat", header=TRUE)
> attach(Races)
> par(mfrow=c(2,2)) # a matrix of 2x2 plots in one graph
> boxplot(timeM); boxplot(timeW)
> hist(timeM); hist(timeW)
> summary(timeM)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.10 47.63 67.17 84.88 113.91 439.15