Summary: Data Science
Intro
1. OSEMN process
2. Machine Learning
An approach to achieve artificial intelligence through systems that can learn from
experience to find patterns in a set of data.
It relies on teaching a computer to recognize patterns by example, rather than
programming it with specific rules
A way to make predictions
o Takes in data
o Learns patterns from said data
o Classifies new data it has not seen before
2.1 Types of ML
Supervised
o Training data is labeled
o System knows expected output label
Unsupervised
o Training data is unlabeled
o We don’t know the output
2.2 Methods of ML
2.2.1 Regression
The variable we wish to predict (the dependent variable) is of a continuous nature.
The value of a given entry is determined based on known cases
Supervised
,2.2.2 Classification
The variable we wish to predict is of a categorical nature.
The label of a given entry is determined based on known labelled cases.
Supervised
2.2.3 Clustering
Detect clusters of observations in our dataset.
A given entry is assigned to a group base on the entire dataset
Unsupervised
3. Notebook
n_obs
o Number of observations or data
points that you want to generate.
x = np.linspace(-3, 3, n_obs)
o Generates n_obs points evenly spaced
between -3 and 3.
X = x[:, np.newaxis]
o Reshapes x into a 2D array (for
compatibility in some models or
algorithms).
y = x + x * np.random.normal(2, 0.5, n_obs):
o Generates the y values by adding random noise to x.
o Noise comes from a normal distribution with a mean of 2 and a standard
deviation of 0.5.
o Adds variability to the relationship between x and y
Make function (Linear Regression) object to implement this algorithm
o regressor = LinearRegression()
Run the OLS algorithm in order to fit the function on our data
o regressor.fit(X, y)
, Linear Regression
1. Regression
In a regression problem we try to understand the behavior (read: analyse / predict) of a
certain (continuous) variable (dependent variable) by studying the influence another
variable (independent variable) has on it.
We want to predict Y based on X
o Does “hours studied” affect the variable “exam grade”?
o Does “age” affect “income”?
o Does “muscle mass” affect “time to run a marathon”?
o Does “advertising budget” affect “products sold”?
2. Linear Regression
The simplest form of regression
A linear model → a straight line through the data
The higher X, the higher (or lower) Y
“line of best fit”
3. Linear relation = linear function
Mathematical function
o 𝑓(𝑥) = 𝑎𝑥 + 𝑏
o 𝑦𝑖 = 𝛽 0+ 𝛽 1𝑥
Beta 0 is the intercept
Where the function crosses the X-axis • Value of
Y when X = 0
Beta 1 is the slope
Postive Beta 1 → the function grows
Negative Beta 1 → the function lowers
The increase amount Y with each increase of X
4. Multiple Linear Regression
Same as linear regression, but with multiple factors
o Ex: “Income” is affected by “seniority” and “years of education”
“Plane of best fit”
What happens with our function?
Our intercept remains
A new “slope” is created for each parameter
o 𝑓(𝑥) = 𝑎𝑥 + 𝑏𝑥 + 𝑐𝑥 + 𝑑𝑥 + … + 𝑒
o 𝑦𝑖 = 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥 + 𝛽3𝑥 + 𝛽4𝑥 + …
5. Model training
We split our data
5.1 Train – Test split
Intro
1. OSEMN process
2. Machine Learning
An approach to achieve artificial intelligence through systems that can learn from
experience to find patterns in a set of data.
It relies on teaching a computer to recognize patterns by example, rather than
programming it with specific rules
A way to make predictions
o Takes in data
o Learns patterns from said data
o Classifies new data it has not seen before
2.1 Types of ML
Supervised
o Training data is labeled
o System knows expected output label
Unsupervised
o Training data is unlabeled
o We don’t know the output
2.2 Methods of ML
2.2.1 Regression
The variable we wish to predict (the dependent variable) is of a continuous nature.
The value of a given entry is determined based on known cases
Supervised
,2.2.2 Classification
The variable we wish to predict is of a categorical nature.
The label of a given entry is determined based on known labelled cases.
Supervised
2.2.3 Clustering
Detect clusters of observations in our dataset.
A given entry is assigned to a group base on the entire dataset
Unsupervised
3. Notebook
n_obs
o Number of observations or data
points that you want to generate.
x = np.linspace(-3, 3, n_obs)
o Generates n_obs points evenly spaced
between -3 and 3.
X = x[:, np.newaxis]
o Reshapes x into a 2D array (for
compatibility in some models or
algorithms).
y = x + x * np.random.normal(2, 0.5, n_obs):
o Generates the y values by adding random noise to x.
o Noise comes from a normal distribution with a mean of 2 and a standard
deviation of 0.5.
o Adds variability to the relationship between x and y
Make function (Linear Regression) object to implement this algorithm
o regressor = LinearRegression()
Run the OLS algorithm in order to fit the function on our data
o regressor.fit(X, y)
, Linear Regression
1. Regression
In a regression problem we try to understand the behavior (read: analyse / predict) of a
certain (continuous) variable (dependent variable) by studying the influence another
variable (independent variable) has on it.
We want to predict Y based on X
o Does “hours studied” affect the variable “exam grade”?
o Does “age” affect “income”?
o Does “muscle mass” affect “time to run a marathon”?
o Does “advertising budget” affect “products sold”?
2. Linear Regression
The simplest form of regression
A linear model → a straight line through the data
The higher X, the higher (or lower) Y
“line of best fit”
3. Linear relation = linear function
Mathematical function
o 𝑓(𝑥) = 𝑎𝑥 + 𝑏
o 𝑦𝑖 = 𝛽 0+ 𝛽 1𝑥
Beta 0 is the intercept
Where the function crosses the X-axis • Value of
Y when X = 0
Beta 1 is the slope
Postive Beta 1 → the function grows
Negative Beta 1 → the function lowers
The increase amount Y with each increase of X
4. Multiple Linear Regression
Same as linear regression, but with multiple factors
o Ex: “Income” is affected by “seniority” and “years of education”
“Plane of best fit”
What happens with our function?
Our intercept remains
A new “slope” is created for each parameter
o 𝑓(𝑥) = 𝑎𝑥 + 𝑏𝑥 + 𝑐𝑥 + 𝑑𝑥 + … + 𝑒
o 𝑦𝑖 = 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥 + 𝛽3𝑥 + 𝛽4𝑥 + …
5. Model training
We split our data
5.1 Train – Test split