Summary

Summary - Applied Machine Learning (5072TOML6Y)

Rating

Sold

Pages

Uploaded on

06-09-2025

Written in

2023/2024

Summary and lecture notes in one.

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Universiteit van Amsterdam (UvA)
Study: Informatiekunde
Course: Toegepaste machine learning (5072TOML6Y)

All documents for this subject (6)

Document information

Uploaded on: September 6, 2025
Number of pages: 17
Written in: 2023/2024
Type: Summary

Subjects

machine learning
information science
uva

Content preview

tTML readings notes – Samirah Bakker

Week 1

What is Machine Learning?

Think of ML as a means of building models of data.

Categories of ML:
1. Supervised learning → Involves somehow modeling the relationship between measured
features of data and some labels associated with the data; once this model is determined, it
can be used to apply labels to new, unknown data.
a. Classification → labels are discrete categories
b. Regression → labels are continuous quantities
2. Unsupervised learning → involves modeling the features of a dataset without reference to
any label.
a. Include tasks such as clustering (identify distinct groups of data) and dimensionality
reduction (search for more succinct representations of the data)
3. Semi-supervised learning → Fall between supervised and unsupervised learning.
a. Often useful when only incomplete labels are available

Classification: Predicting discrete labels

Example:
- feature 1, feature 2, etc. normalized counts of important words or phrases ("Viagra",
"Extended warranty", etc.)
- label "spam" or "not spam"

Regression: Predicting continuous labels

Example:
- feature 1, feature 2, etc. brightness of each galaxy at one of several wavelengths or colors
- label distance or redshift of the galaxy

Clustering: Inferring labels on unlabeled data
- One common case of unsupervised learning is "clustering," in which data is automatically
assigned to some number of discrete groups
- Clustering algorithms partition data into distinct groups of similar items.
- By eye, it is clear that each of these points is part of a distinct group. Given this input, a
clustering model will use the intrinsic structure of the data to determine which points are
related. Using the very fast and intuitive k-means algorithm, we find the clusters.
- k-means fits a model consisting of k cluster centers; the optimal centers are assumed
to be those that minimize the distance of each point from its assigned center.

Dimensionality reduction: Inferring structure of unlabeled data
- Models that detect and identify lower-dimensional structure in higher-dimensional data.
- Labels or other information are inferred from the structure of the dataset itself.

,Introducing Scikit-Learn
→ Package that provides efficient versions of large numbers of common algorithms.

The best way to think about data within Scikit-Learn is in terms of tables. A basic table is a
two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the
columns represent quantities related to each of these elements.

Basics of the API (Application Programming Interface)
1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired values.
3. Arrange data into a features matrix and target vector, as outlined earlier in this chapter.
4. Fit the model to your data by calling the fit method of the model instance.
5. Apply the model to new data:
a. For supervised learning, often we predict labels for unknown data using the predict
method.
b. For unsupervised learning, we often transform or infer properties of the data using the
transform or predict method.

Week 2

K-Means Clustering (Unsupervised learning)

Clustering
→ Is the task of partitioning the dataset into groups, called clusters. The goal is to split up the data in
such a way that points within a single cluster are very similar and points in different clusters are
different. Similarly to classification algorithms, clustering algorithms assign (or predict) a number to
each data point, indicating which cluster a particular point belongs to.

K-means clustering
→ It tries to find cluster centers that are representative of certain regions of the data. The algorithm
alternates between two steps: assigning each data point to the closest cluster center, and then setting
each cluster center as the mean of the data points that are assigned to it. The algorithm is finished
when the assignment of instances to clusters no longer changes.

Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete
labeling of groups of points.

from sklearn.cluster import KMeans

The k-means algorithm searches for a predetermined number of clusters within an unlabeled
multidimensional dataset. It accomplishes this using a simple conception of what the optimal
clustering looks like:
- The cluster center is the arithmetic mean of all the points belonging to the cluster.
- Each point is closer to its own cluster center than to other cluster centers.

, Those two assumptions are the basis of the k-means model.

The typical approach to k-means involves an intuitive iterative approach known as
expectation–maximization.

In short, the expectation–maximization approach here consists of the following procedure:
1. Guess some cluster centers.
2. Repeat until converged:
a. E-step: Assign points to the nearest cluster center.
b. M-step: Set the cluster centers to the mean of their assigned points.

Although the E–M procedure is guaranteed to improve the result in each step, there is no assurance
that it will lead to the global best solution. For example, if we use a different random seed in our
simple procedure, the particular starting guesses lead to poor results. For this reason, it is common for
the algorithm to be run for multiple starting guesses, as indeed Scikit-Learn does by default.

Another common challenge with k-means is that you must tell it how many clusters you expect: it
cannot learn the number of clusters from the data.

The fundamental model assumptions of k-means (points will be closer to their own cluster center than
to others) means that the algorithm will often be ineffective if the clusters have complicated
geometries. In particular, the boundaries between k-means clusters will always be linear, which means
that it will fail for more complicated boundaries.

Because each iteration of k-means must access every point in the dataset, the algorithm can be
relatively slow as the number of samples grows.

- The number of clusters must be selected beforehand
- k-means is limited to linear cluster boundaries
- k-means can be slow for large numbers of samples

One of the drawbacks of k-means is that it relies on a random initialization, which means the outcome
of the algorithm depends on a random seed. By default, scikitlearn runs the algorithm 10 times with
10 different random initializations, and returns the best result.

Agglomerative clustering

Agglomerative clustering → refers to a collection of clustering algorithms that all build upon the same
principles: the algorithm starts by declaring each point its own cluster, and then merges the two most
similar clusters until some stopping criterion is satisfied.

The stopping criterion implemented in scikit-learn is the number of clusters, so similar clusters are
merged until only the specified number of clusters are left. There are several linkage criteria that
specify how exactly the “most similar cluster” is measured. This measure is always defined between
two existing clusters.
The following three choices are implemented in scikit-learn:

$8.62

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

samirahbakker1107

3.7

(3)

Get to know the seller

samirahbakker1107 Universiteit van Amsterdam

View profile

Sold

Member since

3 months

Number of followers

Documents

Last sold

1 day ago

3.7

3 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller samirahbakker1107. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $8.62. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45937 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 15 years now

Summary - Applied Machine Learning (5072TOML6Y)

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?