Lecture 2: Data science pipeline Frame problems in the real world, we need to define and frame the problems first Collect data in the real
world, you may need to collect data using sensors, crowdsourcing, mobile apps. There are also other sources for getting public datasets,
such as Hugging Face, Zenodo, Google Dataset Search, etc Preprocess Data Filtering → can reduce a set of data based on specific
criteria vb. left table can be reduced to the right table using a population threshold. df[df[“population”]>500000]. Aggregation → reduces a
set of data to a descriptive statistic. vb. left table is reduced to a single number by computing the mean value. df[“population”].mean().
Grouping → divides a table into groups by column values, which can be chained with data aggregation to produce descriptive statistics for
each group. vb. df.groupby(“province”).sum(). Sorting → rearranges data based on values in a column, which can be useful for inspection.
vb. right table is sorted by population. df.sort_values(by=[“population””]). Concatenation → combines multiple datasets that have the same
variables. vb. two left tables can be concatenated into the right table. pandas.concat([df_A, df_B]). Merging and joining → method to
merge multiple data tables which have an overlapping set of instances. vb. use “city” as the key to merge A and B. A.merge(B,
how=”inner/left/right/outer”, on=”city”). Quantization → transforms a continuous set of values (e.g. integers) into a discrete set (e.g.
categories) . vb. age is quantized to age range. bin = [0,20,50,200]. L=[“1-20”, “21-50”, “51+”]. pandas.cut(D[“age”], bin, labels=L). Scaling
→ transforms variables to have another distribution, which puts variables at the same scale and makes the data work better on many
models. vb. Z-score scaling → represents how many standard deviations from the mean. (df-df.mean()) / df.std(). vb. min-max scaling →
making the value range between 0 and 1. (df-df.min()) / (df.max()-df.min()). Resample time series data to a different frequency using
different aggregation methods. vb. resample to hourly frequency using mean. df.resample(“60min”, label=”right”).mean(). Rolling → to
transform time series data using different aggregation methods. vb. df[“new_column”] = df[“column1”].rolling(window=3).sum().
Transformation → can be applied to rows or columns in a dataframe. df[“wind_sine”] = np.sin(np.deg2rad(D[“wind_deg”])). Extract data
→ from text or match text patterns with regular expression → language to specify search patterns. vb. df[“year”] =
df[“venue”].str.extract(r’([0-9]{4})’). Drop → data we don’t need, such as duplicate data records or those that are irrelevant to our research
question. pandas.drop(columns=[“year”]. can drop rows or columns. Replace missing values → with a constant, mean, median or the
most frequent value along the same column. constant imputation → -1. mean imputation. Model missing values → where y is the
variable/column that has the missing values, X means other variables, and F is a regression function. vb. y = F(X). different missing data
may require different data cleaning methods. MCAR → missing at completely random. missing data is completely random subset of the
entire dataset. MAR → missing at random. missing data is only related to variables other than the one having missing data. MNAR →
Missing Not At Random. missing data is related to the variable that has the missing data. Explore data. Information visualization is a good
way for both experts and lay people to explore data and gain insights. python seaborn library to quickly plot and explore structure data.
python plotly library to build interactive visualizations. Voyant Tools to explore text data Model data. techniques for modeling structured,
text, and image data through different modules image classification. vb. optical character recognition → recognizing digits from hand-written
images. vb. fine-grained categorization → categorizing the types of birds. text classification. vb. sentiment analysis → identifying emotions
from movie reviews. vb. categorizing the research aspect Deploy models can enable further quantitative or qualitative research with
insights. Data Science Fundamentals (Modeling). Classification. To classify spam messages we need examples → a dataset with
observation (messages) and labels (spam or ham). We can extract features (information) using human knowledge, which can help
distinguish spam and ham messages. Using features x (which contains x1 and x2), we can represent each message as one data point on a
p-dimensional space (p=2 in this case). We can think of the model as a function f that can separate the observation into groups (labels y)
according to their features = {x1, x2}. f(x) > 0 → spam, f(x) < 0 → ham. To find a good function f, we start from f and train it until satisfied.
We need something to tell us which direction and magnitude to update. First, we need an error metric. vb. sum of distances between the
misclassified points and line f. error = -y * f(x) for each misclassified point x={x1,x2}. We can use gradient descent to minimize the error to
train the model f iteratively. Depending on the needs, we can train different models (using different loss function) with various shapes and
decision boundaries. To evaluate our classification model, we need to compute evaluation metrics to measure and quantify model
performance. Accuracy = # of correctly classified points / # of all points. Only works on a balanced dataset. Unbalanced dataset →
accuracy for each class separately. If we care more about the positive class: precision = TP / (TP + FP) → how many selected items are
relevant. Recall = TP / (TP + FN) → how many relevant items are selected. F-Score = 2 * precision * recall / (precision + recall). To choose
models, we need a test set, which contains data that the models have not yet seen before during the training phase. To tune
hyper-parameters or select features for a model, we use cross-validation to divide the dataset into folds and use each fold for validation.
Don’t use the test-set to tune hyper-parameters or select features, which will lead to information leakage. Training set → for training
models. Validation → for tuning hyper-parameters and/or selecting features. Way to select features → recursively eliminate the less
important ones by using metrics like permutation importance → permuting a feature several times and measuring the decrease in model
performance. If two highly correlated features exist, the model can access the information from the non-permuted feature. Thus, it may
appear that both features are not important. A better way is to cluster the correlated features first. For time-series data, it is better to do the
split for cross-validation based on the order of time intervals, which means we only use data in the past to predict the future, but not the
other way around. Regression. Fits a function that maps features x to a continuous variable y. Linear regression → fits a linear function f
that maps x1 (vb. the first feature vector of something) to y, which can best describe their linear relationship. We can now create a feature
matrix X that includes the intercept term 0, which gives us a compact form of equation. We can now generalize linear regression to have
multiple predictors and keep the compact mathematical representation. We use the vector and matrix forms to simplify equations. We can
map vector and matrix forms to data directly. We can look at the feature matrix X from two different directions: one represents the features
→ columns, one represents the data points → rows. Finally, we need an error metric between the estimated response y and the true
response y to know if the model fits the data well. Usually, we assume that the error e is IID (independent and identically distributed) and
follows a normal distribution with zero mean and some variance 2. To find the optimal coefficient, we need to minimize the error using
gradient descent or taking the derivative of its matrix form. We can model a non-linear relationship using a polynomial function with degree
k. Using too complex/simple models can lead to overfitting/underfitting. The model fits the training set well but generalizes poorly on the
dataset. To evaluate regression models, one common metric is the coefficient of determination (R-squared). For simple/multiple linear
regression, R-squared equals the square of Pearson correlation coefficient r between the true y and the estimate y = f(X). R-squared
increases as we add more predictors and thus is not a good metric for model selection. The adjusted R-squared considers the number of
samples (n) and predictors (p). A bad R-squared does not always mean no pattern in the data. A good R-squared does not always mean
that the function fits the data well. R-squared can be greatly affected by outliers. Lecture 3. Structured data generally means the type of
data that has standardized formats and well-defined structures. Mathematically speaking, we want to estimate a function f that can map
feature X to label y such that prediction f(X) is close to y as much as possible. Decision trees. Has a non-linear decision boundary that
iteratively partitions the feature space. For simplicity, assume all features are binary. If we could only ask one question, which question
would we ask? We want to use the most useful feature that gives us the most information to help us guess. How can we quantify which
world, you may need to collect data using sensors, crowdsourcing, mobile apps. There are also other sources for getting public datasets,
such as Hugging Face, Zenodo, Google Dataset Search, etc Preprocess Data Filtering → can reduce a set of data based on specific
criteria vb. left table can be reduced to the right table using a population threshold. df[df[“population”]>500000]. Aggregation → reduces a
set of data to a descriptive statistic. vb. left table is reduced to a single number by computing the mean value. df[“population”].mean().
Grouping → divides a table into groups by column values, which can be chained with data aggregation to produce descriptive statistics for
each group. vb. df.groupby(“province”).sum(). Sorting → rearranges data based on values in a column, which can be useful for inspection.
vb. right table is sorted by population. df.sort_values(by=[“population””]). Concatenation → combines multiple datasets that have the same
variables. vb. two left tables can be concatenated into the right table. pandas.concat([df_A, df_B]). Merging and joining → method to
merge multiple data tables which have an overlapping set of instances. vb. use “city” as the key to merge A and B. A.merge(B,
how=”inner/left/right/outer”, on=”city”). Quantization → transforms a continuous set of values (e.g. integers) into a discrete set (e.g.
categories) . vb. age is quantized to age range. bin = [0,20,50,200]. L=[“1-20”, “21-50”, “51+”]. pandas.cut(D[“age”], bin, labels=L). Scaling
→ transforms variables to have another distribution, which puts variables at the same scale and makes the data work better on many
models. vb. Z-score scaling → represents how many standard deviations from the mean. (df-df.mean()) / df.std(). vb. min-max scaling →
making the value range between 0 and 1. (df-df.min()) / (df.max()-df.min()). Resample time series data to a different frequency using
different aggregation methods. vb. resample to hourly frequency using mean. df.resample(“60min”, label=”right”).mean(). Rolling → to
transform time series data using different aggregation methods. vb. df[“new_column”] = df[“column1”].rolling(window=3).sum().
Transformation → can be applied to rows or columns in a dataframe. df[“wind_sine”] = np.sin(np.deg2rad(D[“wind_deg”])). Extract data
→ from text or match text patterns with regular expression → language to specify search patterns. vb. df[“year”] =
df[“venue”].str.extract(r’([0-9]{4})’). Drop → data we don’t need, such as duplicate data records or those that are irrelevant to our research
question. pandas.drop(columns=[“year”]. can drop rows or columns. Replace missing values → with a constant, mean, median or the
most frequent value along the same column. constant imputation → -1. mean imputation. Model missing values → where y is the
variable/column that has the missing values, X means other variables, and F is a regression function. vb. y = F(X). different missing data
may require different data cleaning methods. MCAR → missing at completely random. missing data is completely random subset of the
entire dataset. MAR → missing at random. missing data is only related to variables other than the one having missing data. MNAR →
Missing Not At Random. missing data is related to the variable that has the missing data. Explore data. Information visualization is a good
way for both experts and lay people to explore data and gain insights. python seaborn library to quickly plot and explore structure data.
python plotly library to build interactive visualizations. Voyant Tools to explore text data Model data. techniques for modeling structured,
text, and image data through different modules image classification. vb. optical character recognition → recognizing digits from hand-written
images. vb. fine-grained categorization → categorizing the types of birds. text classification. vb. sentiment analysis → identifying emotions
from movie reviews. vb. categorizing the research aspect Deploy models can enable further quantitative or qualitative research with
insights. Data Science Fundamentals (Modeling). Classification. To classify spam messages we need examples → a dataset with
observation (messages) and labels (spam or ham). We can extract features (information) using human knowledge, which can help
distinguish spam and ham messages. Using features x (which contains x1 and x2), we can represent each message as one data point on a
p-dimensional space (p=2 in this case). We can think of the model as a function f that can separate the observation into groups (labels y)
according to their features = {x1, x2}. f(x) > 0 → spam, f(x) < 0 → ham. To find a good function f, we start from f and train it until satisfied.
We need something to tell us which direction and magnitude to update. First, we need an error metric. vb. sum of distances between the
misclassified points and line f. error = -y * f(x) for each misclassified point x={x1,x2}. We can use gradient descent to minimize the error to
train the model f iteratively. Depending on the needs, we can train different models (using different loss function) with various shapes and
decision boundaries. To evaluate our classification model, we need to compute evaluation metrics to measure and quantify model
performance. Accuracy = # of correctly classified points / # of all points. Only works on a balanced dataset. Unbalanced dataset →
accuracy for each class separately. If we care more about the positive class: precision = TP / (TP + FP) → how many selected items are
relevant. Recall = TP / (TP + FN) → how many relevant items are selected. F-Score = 2 * precision * recall / (precision + recall). To choose
models, we need a test set, which contains data that the models have not yet seen before during the training phase. To tune
hyper-parameters or select features for a model, we use cross-validation to divide the dataset into folds and use each fold for validation.
Don’t use the test-set to tune hyper-parameters or select features, which will lead to information leakage. Training set → for training
models. Validation → for tuning hyper-parameters and/or selecting features. Way to select features → recursively eliminate the less
important ones by using metrics like permutation importance → permuting a feature several times and measuring the decrease in model
performance. If two highly correlated features exist, the model can access the information from the non-permuted feature. Thus, it may
appear that both features are not important. A better way is to cluster the correlated features first. For time-series data, it is better to do the
split for cross-validation based on the order of time intervals, which means we only use data in the past to predict the future, but not the
other way around. Regression. Fits a function that maps features x to a continuous variable y. Linear regression → fits a linear function f
that maps x1 (vb. the first feature vector of something) to y, which can best describe their linear relationship. We can now create a feature
matrix X that includes the intercept term 0, which gives us a compact form of equation. We can now generalize linear regression to have
multiple predictors and keep the compact mathematical representation. We use the vector and matrix forms to simplify equations. We can
map vector and matrix forms to data directly. We can look at the feature matrix X from two different directions: one represents the features
→ columns, one represents the data points → rows. Finally, we need an error metric between the estimated response y and the true
response y to know if the model fits the data well. Usually, we assume that the error e is IID (independent and identically distributed) and
follows a normal distribution with zero mean and some variance 2. To find the optimal coefficient, we need to minimize the error using
gradient descent or taking the derivative of its matrix form. We can model a non-linear relationship using a polynomial function with degree
k. Using too complex/simple models can lead to overfitting/underfitting. The model fits the training set well but generalizes poorly on the
dataset. To evaluate regression models, one common metric is the coefficient of determination (R-squared). For simple/multiple linear
regression, R-squared equals the square of Pearson correlation coefficient r between the true y and the estimate y = f(X). R-squared
increases as we add more predictors and thus is not a good metric for model selection. The adjusted R-squared considers the number of
samples (n) and predictors (p). A bad R-squared does not always mean no pattern in the data. A good R-squared does not always mean
that the function fits the data well. R-squared can be greatly affected by outliers. Lecture 3. Structured data generally means the type of
data that has standardized formats and well-defined structures. Mathematically speaking, we want to estimate a function f that can map
feature X to label y such that prediction f(X) is close to y as much as possible. Decision trees. Has a non-linear decision boundary that
iteratively partitions the feature space. For simplicity, assume all features are binary. If we could only ask one question, which question
would we ask? We want to use the most useful feature that gives us the most information to help us guess. How can we quantify which