CORRECT ANSWERS
What to examine when assessing the bivariate relationship between a Continuous predictor variable
and a Continuous target variable? ANSW✅✅Scatter plots. Correlation between each variable
[cor() in R].
What to examine when assessing (univariate analysis) a Continuous predictor variable?
ANSW✅✅Assess the histogram of the distribution. Check the skewness (does it need to have a log
transformation).
- Check for extreme (unreasonable) outliers
- Check for obvious errors in data
- Check for obvious duplicates
What to examine when assessing (univariate analysis) a Factor predictor variable?
ANSW✅✅Assess Bar chart. (Count of observations per factor level)
What data questions should be considered while reading the project statement? ANSW✅✅Is the
project statement more interested in interpretable models or more accurate complicated models?
What type of variable is the target variable?
What type of variable are the predictor variables?
Are there any outliers that need to be removed?
Are there any Factor variables that could be combined?
R-Code; Histogram Continuous Variable ANSW✅✅ggplot(df, aes(x = variable)) +
geom_histogram(bins = 30) +
labs(x = "variable")
R-Code; Bar chart for a factor variable ANSW✅✅ggplot(df, aes(x = variable)) +
geom_bar() +
labs(x = "variable")
,What to examine when assessing the bivariate relationship between a Factor predictor variable and
a binary target variable? ANSW✅✅A table to asses (with rows as factor levels) the mean
probabilities, counts of observations of each factor, and counts of each observation of each binary
target.
What to examine when assessing the bivariate relationship between a Continuous predictor variable
and a binary target variable? ANSW✅✅- A graph with separate histograms for a continuous
variable, one for those with target binary = 0 and one for those with binary = 1;
- Box plots summarized based on binary target;
- Tables summarizing the mean, median, and count of the predictor based on each binary target
What to examine when assessing the bivariate relationship between a Factor predictor variable and
a Continuous target variable? ANSW✅✅Box Plots and tables summarizing the mean, median, and
count of the target based on each factor
R-Code; Table for binary target and factor variable ANSW✅✅data %>%
group_by(variable) %>%
summarise(
zeros = sum(Target == 0),
ones = sum(Target == 1),
n = n(),
proportion = mean(Target)
)
R-Code; Separate histograms for a continuous variable and a binary target ANSW✅✅ggplot(
data,
aes(
x = variable,
group = Target,
fill = as.factor(Target),
y = ..density..
)
)+
, geom_histogram(position = "dodge", bins = 30)
R-Code; Relevel Factor variables ANSW✅✅table <- as.data.frame(table(df$variable))
max <- which.max(table[, 2])
level.name <- as.character(table[max, 1])
df$variable <- relevel(df$variable, ref = level.name)
R-Code; Remove all observations in entire data set of a variable greater than or equal to 50
ANSW✅✅data <- data[data$variable <= 50, ]
R-Code; Remove all observations of a factor variable = "value" ANSW✅✅toBeRemoved <-
which(data$factor=="value")
data <- data[-toBeRemoved, ]
R-Code; Combine factor levels into new factors. ANSW✅✅var.levels <- levels(df$variable)
df$occupation_comb <- mapvalues(df$variable, var.levels, c("Group12", ... , "GroupNA"))
R-Code; remove a variable from the dataframe. ANSW✅✅df$variable <- NULL
R-Code; Create training and testing sets. ANSW✅✅set.seed(n)
train_ind <- createDataPartition(df$Target, p = 0.7, list = FALSE)
data.train <- df[train_ind, ]
data.test <- df[-train_ind, ]
What type of data to use a log transformation? ANSW✅✅Right Skewed (common with variables
of Time, Distance, or Money which have a lower boundary of 0)
What type of data to use a Logit transformation? ANSW✅✅Binary (boolean) Target variable
Define Principal Component Analysis ANSW✅✅- An unsupervised learning technique which
linearly combines the initial variables in a data set to create new orthogonal principal components
which then can be used to assess the correlation between the initial variables.