100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Samenvatting - Toegepaste machine learning (5072TOML6Y)

Beoordeling
-
Verkocht
1
Pagina's
17
Geüpload op
06-09-2025
Geschreven in
2023/2024

Samenvatting en lecture notes in één.











Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
6 september 2025
Aantal pagina's
17
Geschreven in
2023/2024
Type
Samenvatting

Voorbeeld van de inhoud

tTML readings notes – Samirah Bakker

Week 1

What is Machine Learning?

Think of ML as a means of building models of data.

Categories of ML:
1.​ Supervised learning → Involves somehow modeling the relationship between measured
features of data and some labels associated with the data; once this model is determined, it
can be used to apply labels to new, unknown data.
a.​ Classification → labels are discrete categories
b.​ Regression → labels are continuous quantities
2.​ Unsupervised learning → involves modeling the features of a dataset without reference to
any label.
a.​ Include tasks such as clustering (identify distinct groups of data) and dimensionality
reduction (search for more succinct representations of the data)
3.​ Semi-supervised learning → Fall between supervised and unsupervised learning.
a.​ Often useful when only incomplete labels are available


Classification: Predicting discrete labels

Example:
-​ feature 1, feature 2, etc. normalized counts of important words or phrases ("Viagra",
"Extended warranty", etc.)
-​ label "spam" or "not spam"

Regression: Predicting continuous labels

Example:
-​ feature 1, feature 2, etc. brightness of each galaxy at one of several wavelengths or colors
-​ label distance or redshift of the galaxy

Clustering: Inferring labels on unlabeled data
-​ One common case of unsupervised learning is "clustering," in which data is automatically
assigned to some number of discrete groups
-​ Clustering algorithms partition data into distinct groups of similar items.
-​ By eye, it is clear that each of these points is part of a distinct group. Given this input, a
clustering model will use the intrinsic structure of the data to determine which points are
related. Using the very fast and intuitive k-means algorithm, we find the clusters.
-​ k-means fits a model consisting of k cluster centers; the optimal centers are assumed
to be those that minimize the distance of each point from its assigned center.

Dimensionality reduction: Inferring structure of unlabeled data
-​ Models that detect and identify lower-dimensional structure in higher-dimensional data.
-​ Labels or other information are inferred from the structure of the dataset itself.

,Introducing Scikit-Learn
→ Package that provides efficient versions of large numbers of common algorithms.

The best way to think about data within Scikit-Learn is in terms of tables. A basic table is a
two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the
columns represent quantities related to each of these elements.

Basics of the API (Application Programming Interface)
1.​ Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2.​ Choose model hyperparameters by instantiating this class with desired values.
3.​ Arrange data into a features matrix and target vector, as outlined earlier in this chapter.
4.​ Fit the model to your data by calling the fit method of the model instance.
5.​ Apply the model to new data:
a.​ For supervised learning, often we predict labels for unknown data using the predict
method.
b.​ For unsupervised learning, we often transform or infer properties of the data using the
transform or predict method.

Week 2

K-Means Clustering (Unsupervised learning)


Clustering
→ Is the task of partitioning the dataset into groups, called clusters. The goal is to split up the data in
such a way that points within a single cluster are very similar and points in different clusters are
different. Similarly to classification algorithms, clustering algorithms assign (or predict) a number to
each data point, indicating which cluster a particular point belongs to.

K-means clustering
→ It tries to find cluster centers that are representative of certain regions of the data. The algorithm
alternates between two steps: assigning each data point to the closest cluster center, and then setting
each cluster center as the mean of the data points that are assigned to it. The algorithm is finished
when the assignment of instances to clusters no longer changes.

Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete
labeling of groups of points.

from sklearn.cluster import KMeans

The k-means algorithm searches for a predetermined number of clusters within an unlabeled
multidimensional dataset. It accomplishes this using a simple conception of what the optimal
clustering looks like:
-​ The cluster center is the arithmetic mean of all the points belonging to the cluster.
-​ Each point is closer to its own cluster center than to other cluster centers.

, Those two assumptions are the basis of the k-means model.

The typical approach to k-means involves an intuitive iterative approach known as
expectation–maximization.

In short, the expectation–maximization approach here consists of the following procedure:
1.​ Guess some cluster centers.
2.​ Repeat until converged:
a.​ E-step: Assign points to the nearest cluster center.
b.​ M-step: Set the cluster centers to the mean of their assigned points.

Although the E–M procedure is guaranteed to improve the result in each step, there is no assurance
that it will lead to the global best solution. For example, if we use a different random seed in our
simple procedure, the particular starting guesses lead to poor results. For this reason, it is common for
the algorithm to be run for multiple starting guesses, as indeed Scikit-Learn does by default.

Another common challenge with k-means is that you must tell it how many clusters you expect: it
cannot learn the number of clusters from the data.

The fundamental model assumptions of k-means (points will be closer to their own cluster center than
to others) means that the algorithm will often be ineffective if the clusters have complicated
geometries. In particular, the boundaries between k-means clusters will always be linear, which means
that it will fail for more complicated boundaries.

Because each iteration of k-means must access every point in the dataset, the algorithm can be
relatively slow as the number of samples grows.

-​ The number of clusters must be selected beforehand
-​ k-means is limited to linear cluster boundaries
-​ k-means can be slow for large numbers of samples

One of the drawbacks of k-means is that it relies on a random initialization, which means the outcome
of the algorithm depends on a random seed. By default, scikitlearn runs the algorithm 10 times with
10 different random initializations, and returns the best result.

Agglomerative clustering

Agglomerative clustering → refers to a collection of clustering algorithms that all build upon the same
principles: the algorithm starts by declaring each point its own cluster, and then merges the two most
similar clusters until some stopping criterion is satisfied.

The stopping criterion implemented in scikit-learn is the number of clusters, so similar clusters are
merged until only the specified number of clusters are left. There are several linkage criteria that
specify how exactly the “most similar cluster” is measured. This measure is always defined between
two existing clusters.
The following three choices are implemented in scikit-learn:

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
samirahbakker1107 Universiteit van Amsterdam
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
11
Lid sinds
3 maanden
Aantal volgers
0
Documenten
12
Laatst verkocht
1 dag geleden

3,7

3 beoordelingen

5
2
4
0
3
0
2
0
1
1

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen