100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary - Applied Machine Learning (5072TOML6Y)

Rating
-
Sold
1
Pages
17
Uploaded on
06-09-2025
Written in
2023/2024

Summary and lecture notes in one.

Institution
Course










Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
September 6, 2025
Number of pages
17
Written in
2023/2024
Type
Summary

Subjects

Content preview

tTML readings notes – Samirah Bakker

Week 1

What is Machine Learning?

Think of ML as a means of building models of data.

Categories of ML:
1.​ Supervised learning → Involves somehow modeling the relationship between measured
features of data and some labels associated with the data; once this model is determined, it
can be used to apply labels to new, unknown data.
a.​ Classification → labels are discrete categories
b.​ Regression → labels are continuous quantities
2.​ Unsupervised learning → involves modeling the features of a dataset without reference to
any label.
a.​ Include tasks such as clustering (identify distinct groups of data) and dimensionality
reduction (search for more succinct representations of the data)
3.​ Semi-supervised learning → Fall between supervised and unsupervised learning.
a.​ Often useful when only incomplete labels are available


Classification: Predicting discrete labels

Example:
-​ feature 1, feature 2, etc. normalized counts of important words or phrases ("Viagra",
"Extended warranty", etc.)
-​ label "spam" or "not spam"

Regression: Predicting continuous labels

Example:
-​ feature 1, feature 2, etc. brightness of each galaxy at one of several wavelengths or colors
-​ label distance or redshift of the galaxy

Clustering: Inferring labels on unlabeled data
-​ One common case of unsupervised learning is "clustering," in which data is automatically
assigned to some number of discrete groups
-​ Clustering algorithms partition data into distinct groups of similar items.
-​ By eye, it is clear that each of these points is part of a distinct group. Given this input, a
clustering model will use the intrinsic structure of the data to determine which points are
related. Using the very fast and intuitive k-means algorithm, we find the clusters.
-​ k-means fits a model consisting of k cluster centers; the optimal centers are assumed
to be those that minimize the distance of each point from its assigned center.

Dimensionality reduction: Inferring structure of unlabeled data
-​ Models that detect and identify lower-dimensional structure in higher-dimensional data.
-​ Labels or other information are inferred from the structure of the dataset itself.

,Introducing Scikit-Learn
→ Package that provides efficient versions of large numbers of common algorithms.

The best way to think about data within Scikit-Learn is in terms of tables. A basic table is a
two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the
columns represent quantities related to each of these elements.

Basics of the API (Application Programming Interface)
1.​ Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2.​ Choose model hyperparameters by instantiating this class with desired values.
3.​ Arrange data into a features matrix and target vector, as outlined earlier in this chapter.
4.​ Fit the model to your data by calling the fit method of the model instance.
5.​ Apply the model to new data:
a.​ For supervised learning, often we predict labels for unknown data using the predict
method.
b.​ For unsupervised learning, we often transform or infer properties of the data using the
transform or predict method.

Week 2

K-Means Clustering (Unsupervised learning)


Clustering
→ Is the task of partitioning the dataset into groups, called clusters. The goal is to split up the data in
such a way that points within a single cluster are very similar and points in different clusters are
different. Similarly to classification algorithms, clustering algorithms assign (or predict) a number to
each data point, indicating which cluster a particular point belongs to.

K-means clustering
→ It tries to find cluster centers that are representative of certain regions of the data. The algorithm
alternates between two steps: assigning each data point to the closest cluster center, and then setting
each cluster center as the mean of the data points that are assigned to it. The algorithm is finished
when the assignment of instances to clusters no longer changes.

Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete
labeling of groups of points.

from sklearn.cluster import KMeans

The k-means algorithm searches for a predetermined number of clusters within an unlabeled
multidimensional dataset. It accomplishes this using a simple conception of what the optimal
clustering looks like:
-​ The cluster center is the arithmetic mean of all the points belonging to the cluster.
-​ Each point is closer to its own cluster center than to other cluster centers.

, Those two assumptions are the basis of the k-means model.

The typical approach to k-means involves an intuitive iterative approach known as
expectation–maximization.

In short, the expectation–maximization approach here consists of the following procedure:
1.​ Guess some cluster centers.
2.​ Repeat until converged:
a.​ E-step: Assign points to the nearest cluster center.
b.​ M-step: Set the cluster centers to the mean of their assigned points.

Although the E–M procedure is guaranteed to improve the result in each step, there is no assurance
that it will lead to the global best solution. For example, if we use a different random seed in our
simple procedure, the particular starting guesses lead to poor results. For this reason, it is common for
the algorithm to be run for multiple starting guesses, as indeed Scikit-Learn does by default.

Another common challenge with k-means is that you must tell it how many clusters you expect: it
cannot learn the number of clusters from the data.

The fundamental model assumptions of k-means (points will be closer to their own cluster center than
to others) means that the algorithm will often be ineffective if the clusters have complicated
geometries. In particular, the boundaries between k-means clusters will always be linear, which means
that it will fail for more complicated boundaries.

Because each iteration of k-means must access every point in the dataset, the algorithm can be
relatively slow as the number of samples grows.

-​ The number of clusters must be selected beforehand
-​ k-means is limited to linear cluster boundaries
-​ k-means can be slow for large numbers of samples

One of the drawbacks of k-means is that it relies on a random initialization, which means the outcome
of the algorithm depends on a random seed. By default, scikitlearn runs the algorithm 10 times with
10 different random initializations, and returns the best result.

Agglomerative clustering

Agglomerative clustering → refers to a collection of clustering algorithms that all build upon the same
principles: the algorithm starts by declaring each point its own cluster, and then merges the two most
similar clusters until some stopping criterion is satisfied.

The stopping criterion implemented in scikit-learn is the number of clusters, so similar clusters are
merged until only the specified number of clusters are left. There are several linkage criteria that
specify how exactly the “most similar cluster” is measured. This measure is always defined between
two existing clusters.
The following three choices are implemented in scikit-learn:

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
samirahbakker1107 Universiteit van Amsterdam
Follow You need to be logged in order to follow users or courses
Sold
11
Member since
3 months
Number of followers
0
Documents
12
Last sold
1 day ago

3.7

3 reviews

5
2
4
0
3
0
2
0
1
1

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions