Samenvatting

Samenvatting - Toegepaste machine learning (5072TOML6Y)

Beoordeling

Verkocht

Pagina's

Geüpload op

06-09-2025

Geschreven in

2023/2024

Samenvatting en lecture notes in één.

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Geschreven voor

Instelling: Universiteit van Amsterdam (UvA)
Studie: Informatiekunde
Vak: Toegepaste machine learning (5072TOML6Y)

Alle documenten voor dit vak (6)

Documentinformatie

Geüpload op: 6 september 2025
Aantal pagina's: 17
Geschreven in: 2023/2024
Type: Samenvatting

Onderwerpen

Voorbeeld van de inhoud

tTML readings notes – Samirah Bakker

Week 1

What is Machine Learning?

Think of ML as a means of building models of data.

Categories of ML:
1. Supervised learning → Involves somehow modeling the relationship between measured
features of data and some labels associated with the data; once this model is determined, it
can be used to apply labels to new, unknown data.
a. Classification → labels are discrete categories
b. Regression → labels are continuous quantities
2. Unsupervised learning → involves modeling the features of a dataset without reference to
any label.
a. Include tasks such as clustering (identify distinct groups of data) and dimensionality
reduction (search for more succinct representations of the data)
3. Semi-supervised learning → Fall between supervised and unsupervised learning.
a. Often useful when only incomplete labels are available

Classification: Predicting discrete labels

Example:
- feature 1, feature 2, etc. normalized counts of important words or phrases ("Viagra",
"Extended warranty", etc.)
- label "spam" or "not spam"

Regression: Predicting continuous labels

Example:
- feature 1, feature 2, etc. brightness of each galaxy at one of several wavelengths or colors
- label distance or redshift of the galaxy

Clustering: Inferring labels on unlabeled data
- One common case of unsupervised learning is "clustering," in which data is automatically
assigned to some number of discrete groups
- Clustering algorithms partition data into distinct groups of similar items.
- By eye, it is clear that each of these points is part of a distinct group. Given this input, a
clustering model will use the intrinsic structure of the data to determine which points are
related. Using the very fast and intuitive k-means algorithm, we find the clusters.
- k-means fits a model consisting of k cluster centers; the optimal centers are assumed
to be those that minimize the distance of each point from its assigned center.

Dimensionality reduction: Inferring structure of unlabeled data
- Models that detect and identify lower-dimensional structure in higher-dimensional data.
- Labels or other information are inferred from the structure of the dataset itself.

,Introducing Scikit-Learn
→ Package that provides efficient versions of large numbers of common algorithms.

The best way to think about data within Scikit-Learn is in terms of tables. A basic table is a
two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the
columns represent quantities related to each of these elements.

Basics of the API (Application Programming Interface)
1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose model hyperparameters by instantiating this class with desired values.
3. Arrange data into a features matrix and target vector, as outlined earlier in this chapter.
4. Fit the model to your data by calling the fit method of the model instance.
5. Apply the model to new data:
a. For supervised learning, often we predict labels for unknown data using the predict
method.
b. For unsupervised learning, we often transform or infer properties of the data using the
transform or predict method.

Week 2

K-Means Clustering (Unsupervised learning)

Clustering
→ Is the task of partitioning the dataset into groups, called clusters. The goal is to split up the data in
such a way that points within a single cluster are very similar and points in different clusters are
different. Similarly to classification algorithms, clustering algorithms assign (or predict) a number to
each data point, indicating which cluster a particular point belongs to.

K-means clustering
→ It tries to find cluster centers that are representative of certain regions of the data. The algorithm
alternates between two steps: assigning each data point to the closest cluster center, and then setting
each cluster center as the mean of the data points that are assigned to it. The algorithm is finished
when the assignment of instances to clusters no longer changes.

Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete
labeling of groups of points.

from sklearn.cluster import KMeans

The k-means algorithm searches for a predetermined number of clusters within an unlabeled
multidimensional dataset. It accomplishes this using a simple conception of what the optimal
clustering looks like:
- The cluster center is the arithmetic mean of all the points belonging to the cluster.
- Each point is closer to its own cluster center than to other cluster centers.

, Those two assumptions are the basis of the k-means model.

The typical approach to k-means involves an intuitive iterative approach known as
expectation–maximization.

In short, the expectation–maximization approach here consists of the following procedure:
1. Guess some cluster centers.
2. Repeat until converged:
a. E-step: Assign points to the nearest cluster center.
b. M-step: Set the cluster centers to the mean of their assigned points.

Although the E–M procedure is guaranteed to improve the result in each step, there is no assurance
that it will lead to the global best solution. For example, if we use a different random seed in our
simple procedure, the particular starting guesses lead to poor results. For this reason, it is common for
the algorithm to be run for multiple starting guesses, as indeed Scikit-Learn does by default.

Another common challenge with k-means is that you must tell it how many clusters you expect: it
cannot learn the number of clusters from the data.

The fundamental model assumptions of k-means (points will be closer to their own cluster center than
to others) means that the algorithm will often be ineffective if the clusters have complicated
geometries. In particular, the boundaries between k-means clusters will always be linear, which means
that it will fail for more complicated boundaries.

Because each iteration of k-means must access every point in the dataset, the algorithm can be
relatively slow as the number of samples grows.

- The number of clusters must be selected beforehand
- k-means is limited to linear cluster boundaries
- k-means can be slow for large numbers of samples

One of the drawbacks of k-means is that it relies on a random initialization, which means the outcome
of the algorithm depends on a random seed. By default, scikitlearn runs the algorithm 10 times with
10 different random initializations, and returns the best result.

Agglomerative clustering

Agglomerative clustering → refers to a collection of clustering algorithms that all build upon the same
principles: the algorithm starts by declaring each point its own cluster, and then merges the two most
similar clusters until some stopping criterion is satisfied.

The stopping criterion implemented in scikit-learn is the number of clusters, so similar clusters are
merged until only the specified number of clusters are left. There are several linkage criteria that
specify how exactly the “most similar cluster” is measured. This measure is always defined between
two existing clusters.
The following three choices are implemented in scikit-learn:

€7,16

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

samirahbakker1107

3,7

(3)

Maak kennis met de verkoper

samirahbakker1107 Universiteit van Amsterdam

Bekijk profiel

Volgen

Verkocht

Lid sinds

3 maanden

Aantal volgers

Documenten

Laatst verkocht

1 dag geleden

3,7

3 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper samirahbakker1107. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €7,16. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 45937 samenvattingen verkocht Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Samenvatting - Toegepaste machine learning (5072TOML6Y)

Geschreven voor

Documentinformatie

Onderwerpen

Voorbeeld van de inhoud

Meer vakken binnen Universiteit van Amsterdam (UvA) > Informatiekunde

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?