Module 1: The Data Science Shift
Overview:
This module introduces the foundation of the data-driven decision-making process and
highlights how data science is transforming business analysis. It outlines the complete data
science workflow, from understanding the business problem to communicating results
effectively.
1. Understanding the Business Problem
Before diving into data, it's crucial to define a clear and actionable business question. For
example, "What makes for a bad car purchase?" This ensures data efforts are aligned with a
strategic decision.
2. Data Wrangling
Data wrangling involves preparing raw data for analysis—handling missing values, cleaning
inconsistencies, and structuring it logically. For instance:
r
carvana.data = read.csv("training.csv", na.strings=c("NULL"))
This code snippet treats "NULL" as a missing value (NA) during import, which is vital for accurate
analysis.
Common functions used:
• summary() – Provides descriptive statistics.
• dim() – Reveals dataset dimensions (rows × columns).
3. Visualization
Visualizations help explore patterns, trends, and anomalies quickly. They're not just about
aesthetics—they're tools for interactive discovery and challenging assumptions.
,Data visualizations are essential from data wrangling to communicating results. Misleading
graphs or poor visual design can obscure key insights or lead to incorrect conclusions.
Examples:
• Time series plots to track pricing trends.
• Boxplots to highlight outliers in mileage or price.
4. Generating Hypotheses
Turn broad business questions into testable hypotheses. For example:
“What makes a bad buy?” → “Vehicles with more than 120,000 miles and fewer than 3 prior
owners are more likely to be returned within 30 days.”
This transition is crucial—it narrows the focus and sets the stage for measurable insights and
predictive modeling.
5. Analysis
This stage involves statistical tests, model building, and deeper diagnostics to validate (or refute)
hypotheses. While not deeply covered in Module 1, it's where later modules will pick up.
6. Communicating Results
You must translate technical results into business impact—executives need actionable insights,
not code. Good communication bridges the gap between analysts and decision-makers.
Key Takeaway:
Skipping early steps like data wrangling or visual exploration can lead to flawed analysis and
poor decisions.
A structured, hypothesis-driven approach improves both the reliability and impact of business
decisions.
, Module 2: Data Wrangling — Cleaning, Merging, and Preparing Data
Overview:
In this module, we enter the essential, gritty phase of data science—data wrangling. Raw data is
often messy, incomplete, or inconsistent, and cleaning it well is a prerequisite for any reliable
analysis. We also explore how to combine multiple datasets using joins and assess missingness,
all while maintaining reproducibility and clarity in our workflow.
1. Importance of a Querying Language
The image you shared emphasizes the purpose of using reproducible code:
• Reproduce results over time – Ensure analysis is consistent when revisited.
• Offer clarity on the process – Make it easier for others to follow your logic.
• Share and communicate insights – Documented queries enable transparency and
collaboration.
Languages like R or Python are not just tools—they’re essential for encoding your data logic
clearly and precisely.
2. Missing Data and Imputation
Missing values (NA) are common in real-world data. If we ignore them, we risk biased results.
Here's how you handled mean imputation using a loop in R:
r
impute.cols = c("NUMBER_OF_BORROWERS", "DEBT_TO_INCOME_RATIO",
"BORROWER_CREDIT_SCORE", "MORTGAGE_INSURANCE_PERCENTAGE",
"CO_BORROWER_CREDIT_SCORE", "MSA_POPULATION")
for (i in impute.cols){
fannie.data[,i] = ifelse(is.na(fannie.data[,i]),
mean(fannie.data[,i], na.rm=TRUE),
fannie.data[,i])