2022-11-30
Exercise 1.1
Inspecting Data:
library(haven)
Data1.1 <- read_dta("C:/Users/bassp/Downloads/hedonicprices2020-
1.dta")
summary(Data1.1)
Using the “summary()” function, we found that the data contains negative prices and
missing values.
Exclude missing and wrong observations:
library(stats)
Data1.2 <- Data1.1[complete.cases(Data1.1$price),]
Data1.2 <- Data1.2[complete.cases(Data1.2$size),]
Data1.2 <- Data1.2[complete.cases(Data1.2$pricesqm),]
Data1.2 <- Data1.2[Data1.2$price >= 0, ]
The function ‘complete.cases()’ creates a logical vector in which the cases are complete,
i.e. have no missing values. We made use of this function to exclude missing observations in
terms of prices, house size and price per m2.
We made use of indexing to exclude negative (wrong) prices. All price values in ‘Data1.2’
must be higher or equal to zero.
Drop outliers:
lower_bound_pricesqm <- quantile(Data1.2$pricesqm, 0.025)
upper_bound_pricesqm <- quantile(Data1.2$pricesqm, 0.975)
To detect the outliers in terms of prices, house size and price per m2, we made use of the
percentiles method. With this method, all observations that lie outside the interval formed
by the 2.5 and 97.5 percentiles will be considered as potential outliers.
The values of the lower and upper percentiles (and thus the lower and upper limits of the
interval) can be computed with the ‘quantile()’ function (Soetewey, 2020).
outlier_ind_pricesqm <- which(Data1.2$pricesqm < lower_bound_pricesqm
| Data1.2$pricesqm > upper_bound_pricesqm)
Data1.3 <- Data1.2[-outlier_ind_pricesqm,]