Summary

Summary Midterm Data Science cheat sheet (fabric week 1-3)

Rating

Sold

Pages

Uploaded on

24-02-2025

Written in

2024/2025

Good cheat sheet that contains all the material for midterm data science. With examples of panda code and sum effects. Divided by Lecture with the most important terms in bold so that they are easy to find.

Institution

Course

Content preview

Lecture 2: Data science pipeline Frame problems in the real world, we need to define and frame the problems first Collect data in the real
world, you may need to collect data using sensors, crowdsourcing, mobile apps. There are also other sources for getting public datasets,
such as Hugging Face, Zenodo, Google Dataset Search, etc Preprocess Data Filtering → can reduce a set of data based on specific
criteria vb. left table can be reduced to the right table using a population threshold. df[df[“population”]>500000]. Aggregation → reduces a
set of data to a descriptive statistic. vb. left table is reduced to a single number by computing the mean value. df[“population”].mean().
Grouping → divides a table into groups by column values, which can be chained with data aggregation to produce descriptive statistics for
each group. vb. df.groupby(“province”).sum(). Sorting → rearranges data based on values in a column, which can be useful for inspection.
vb. right table is sorted by population. df.sort_values(by=[“population””]). Concatenation → combines multiple datasets that have the same
variables. vb. two left tables can be concatenated into the right table. pandas.concat([df_A, df_B]). Merging and joining → method to
merge multiple data tables which have an overlapping set of instances. vb. use “city” as the key to merge A and B. A.merge(B,
how=”inner/left/right/outer”, on=”city”). Quantization → transforms a continuous set of values (e.g. integers) into a discrete set (e.g.
categories) . vb. age is quantized to age range. bin = [0,20,50,200]. L=[“1-20”, “21-50”, “51+”]. pandas.cut(D[“age”], bin, labels=L). Scaling
→ transforms variables to have another distribution, which puts variables at the same scale and makes the data work better on many
models. vb. Z-score scaling → represents how many standard deviations from the mean. (df-df.mean()) / df.std(). vb. min-max scaling →
making the value range between 0 and 1. (df-df.min()) / (df.max()-df.min()). Resample time series data to a different frequency using
different aggregation methods. vb. resample to hourly frequency using mean. df.resample(“60min”, label=”right”).mean(). Rolling → to
transform time series data using different aggregation methods. vb. df[“new_column”] = df[“column1”].rolling(window=3).sum().
Transformation → can be applied to rows or columns in a dataframe. df[“wind_sine”] = np.sin(np.deg2rad(D[“wind_deg”])). Extract data
→ from text or match text patterns with regular expression → language to specify search patterns. vb. df[“year”] =
df[“venue”].str.extract(r’([0-9]{4})’). Drop → data we don’t need, such as duplicate data records or those that are irrelevant to our research
question. pandas.drop(columns=[“year”]. can drop rows or columns. Replace missing values → with a constant, mean, median or the
most frequent value along the same column. constant imputation → -1. mean imputation. Model missing values → where y is the
variable/column that has the missing values, X means other variables, and F is a regression function. vb. y = F(X). different missing data
may require different data cleaning methods. MCAR → missing at completely random. missing data is completely random subset of the
entire dataset. MAR → missing at random. missing data is only related to variables other than the one having missing data. MNAR →
Missing Not At Random. missing data is related to the variable that has the missing data. Explore data. Information visualization is a good
way for both experts and lay people to explore data and gain insights. python seaborn library to quickly plot and explore structure data.
python plotly library to build interactive visualizations. Voyant Tools to explore text data Model data. techniques for modeling structured,
text, and image data through different modules image classification. vb. optical character recognition → recognizing digits from hand-written
images. vb. fine-grained categorization → categorizing the types of birds. text classification. vb. sentiment analysis → identifying emotions
from movie reviews. vb. categorizing the research aspect Deploy models can enable further quantitative or qualitative research with
insights. Data Science Fundamentals (Modeling). Classification. To classify spam messages we need examples → a dataset with
observation (messages) and labels (spam or ham). We can extract features (information) using human knowledge, which can help
distinguish spam and ham messages. Using features x (which contains x1 and x2), we can represent each message as one data point on a
p-dimensional space (p=2 in this case). We can think of the model as a function f that can separate the observation into groups (labels y)
according to their features = {x1, x2}. f(x) > 0 → spam, f(x) < 0 → ham. To find a good function f, we start from f and train it until satisfied.
We need something to tell us which direction and magnitude to update. First, we need an error metric. vb. sum of distances between the
misclassified points and line f. error = -y * f(x) for each misclassified point x={x1,x2}. We can use gradient descent to minimize the error to
train the model f iteratively. Depending on the needs, we can train different models (using different loss function) with various shapes and
decision boundaries. To evaluate our classification model, we need to compute evaluation metrics to measure and quantify model
performance. Accuracy = # of correctly classified points / # of all points. Only works on a balanced dataset. Unbalanced dataset →
accuracy for each class separately. If we care more about the positive class: precision = TP / (TP + FP) → how many selected items are
relevant. Recall = TP / (TP + FN) → how many relevant items are selected. F-Score = 2 * precision * recall / (precision + recall). To choose
models, we need a test set, which contains data that the models have not yet seen before during the training phase. To tune
hyper-parameters or select features for a model, we use cross-validation to divide the dataset into folds and use each fold for validation.
Don’t use the test-set to tune hyper-parameters or select features, which will lead to information leakage. Training set → for training
models. Validation → for tuning hyper-parameters and/or selecting features. Way to select features → recursively eliminate the less
important ones by using metrics like permutation importance → permuting a feature several times and measuring the decrease in model
performance. If two highly correlated features exist, the model can access the information from the non-permuted feature. Thus, it may
appear that both features are not important. A better way is to cluster the correlated features first. For time-series data, it is better to do the
split for cross-validation based on the order of time intervals, which means we only use data in the past to predict the future, but not the
other way around. Regression. Fits a function that maps features x to a continuous variable y. Linear regression → fits a linear function f
that maps x1 (vb. the first feature vector of something) to y, which can best describe their linear relationship. We can now create a feature
matrix X that includes the intercept term 0, which gives us a compact form of equation. We can now generalize linear regression to have
multiple predictors and keep the compact mathematical representation. We use the vector and matrix forms to simplify equations. We can
map vector and matrix forms to data directly. We can look at the feature matrix X from two different directions: one represents the features
→ columns, one represents the data points → rows. Finally, we need an error metric between the estimated response y and the true
response y to know if the model fits the data well. Usually, we assume that the error e is IID (independent and identically distributed) and
follows a normal distribution with zero mean and some variance 2. To find the optimal coefficient, we need to minimize the error using
gradient descent or taking the derivative of its matrix form. We can model a non-linear relationship using a polynomial function with degree
k. Using too complex/simple models can lead to overfitting/underfitting. The model fits the training set well but generalizes poorly on the
dataset. To evaluate regression models, one common metric is the coefficient of determination (R-squared). For simple/multiple linear
regression, R-squared equals the square of Pearson correlation coefficient r between the true y and the estimate y = f(X). R-squared
increases as we add more predictors and thus is not a good metric for model selection. The adjusted R-squared considers the number of
samples (n) and predictors (p). A bad R-squared does not always mean no pattern in the data. A good R-squared does not always mean
that the function fits the data well. R-squared can be greatly affected by outliers. Lecture 3. Structured data generally means the type of
data that has standardized formats and well-defined structures. Mathematically speaking, we want to estimate a function f that can map
feature X to label y such that prediction f(X) is close to y as much as possible. Decision trees. Has a non-linear decision boundary that
iteratively partitions the feature space. For simplicity, assume all features are binary. If we could only ask one question, which question
would we ask? We want to use the most useful feature that gives us the most information to help us guess. How can we quantify which

Report Copyright Violation

Written for

Institution: Universiteit van Amsterdam (UvA)
Study: informatiekunde
Course: Data Science (5072DASC6Y)

All documents for this subject (9)

Document information

Uploaded on: February 24, 2025
Number of pages: 2
Written in: 2024/2025
Type: Summary

Subjects

data science
spiekbrief
deep learning
data science pipelinne
modeling
precision
recall
gradient descient
deep learning networks
accuracy
f score

$9.91

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

FloorReeuwijk

3.5

(2)

Get to know the seller

FloorReeuwijk Universiteit van Amsterdam

View profile

Sold

Member since

1 year

Number of followers

Documents

Last sold

3 months ago

3.5

2 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller FloorReeuwijk. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $9.91. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 59056 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Summary Midterm Data Science cheat sheet (fabric week 1-3)

Content preview

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?