100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Tentamen (uitwerkingen)

DATA MINING EXAM REVISION QUESTIONS WITH CORRECT ANSWERS

Beoordeling
-
Verkocht
-
Pagina's
16
Cijfer
A+
Geüpload op
26-07-2025
Geschreven in
2024/2025

DATA MINING EXAM REVISION QUESTIONS WITH CORRECT ANSWERS Clustering: Why use CURE? - Answer-Because clustering is expensive Clustering: Does CURE require Euclidean distance? - Answer-Yes, the moving representative point towards centroid part happens in Euclidean space Clustering: Describe BFR - Answer-1) Load in the first batch 2) Run clustering algorithm to find clusters 3) For each cluster, only keep track of it's size, sum of vectors, and sum of squared vectors. These are used to compute mean and variance 4) Load the next batch 5) For each point, find the nearest cluster using Mahalanobis distance 6) If it's close enough, add the point to that cluster & update the cluster statistics. Else put this point in the leftover set 7) Load the next batch & the leftover set Clustering: Does BFR require Euclidean distance? - Answer-Yes, it uses cetroids Clustering: What does BFR assume? - Answer-1) Data is a Gaussian mixture 2) Features are independent (covariance = variance) Clustering: What are the advantages of hierarchical clustering? - Answer-1) Like Kmeans but we can quickly adjust K by merging, rather than having to recluster 2) Can merge circular clusters into non circular clusters Clustering: What is Ward distance? - Answer-Total sum of square for each cluster before merging - Sum of square after merging Clustering: Does hierarchical clustering require Euclidean disance? - Answer-For single, complete, average linkage no. For Ward distance yes (due to sum of square) Anomaly detection: What's the problem with sliding window? - Answer-The dataset enlarges significantly. Anomaly detection: Why is it not a good idea to do cross validation on time series data? - Answer-It's slow. You can retrain the model in one period then predict something way into the future. To predict a test point you need to train your model on the series leading up point, then retrain the model for another test point somewhere else in the series. Anomaly detection: Why is temporal correlation a problem? - Answer-So long as a datapoint is correlated with a previous point, trend can emerge even though data is randomly generated (e.g. previous point + noise)

Meer zien Lees minder
Instelling
DATA MINING
Vak
DATA MINING










Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Geschreven voor

Instelling
DATA MINING
Vak
DATA MINING

Documentinformatie

Geüpload op
26 juli 2025
Aantal pagina's
16
Geschreven in
2024/2025
Type
Tentamen (uitwerkingen)
Bevat
Vragen en antwoorden

Onderwerpen

Voorbeeld van de inhoud

DATA MINING EXAM REVISION
QUESTIONS WITH CORRECT
ANSWERS
Clustering: Why use CURE? - Answer-Because clustering is expensive

Clustering: Does CURE require Euclidean distance? - Answer-Yes, the moving
representative point towards centroid part happens in Euclidean space

Clustering: Describe BFR - Answer-1) Load in the first batch
2) Run clustering algorithm to find clusters
3) For each cluster, only keep track of it's size, sum of vectors, and sum of squared
vectors. These are used to compute mean and variance
4) Load the next batch
5) For each point, find the nearest cluster using Mahalanobis distance
6) If it's close enough, add the point to that cluster & update the cluster statistics.
Else put this point in the leftover set
7) Load the next batch & the leftover set

Clustering: Does BFR require Euclidean distance? - Answer-Yes, it uses cetroids

Clustering: What does BFR assume? - Answer-1) Data is a Gaussian mixture
2) Features are independent (covariance = variance)

Clustering: What are the advantages of hierarchical clustering? - Answer-1) Like
Kmeans but we can quickly adjust K by merging, rather than having to recluster
2) Can merge circular clusters into non circular clusters

Clustering: What is Ward distance? - Answer-Total sum of square for each cluster
before merging - Sum of square after merging

Clustering: Does hierarchical clustering require Euclidean disance? - Answer-For
single, complete, average linkage no. For Ward distance yes (due to sum of square)

Anomaly detection: What's the problem with sliding window? - Answer-The dataset
enlarges significantly.

Anomaly detection: Why is it not a good idea to do cross validation on time series
data? - Answer-It's slow. You can retrain the model in one period then predict
something way into the future. To predict a test point you need to train your model on
the series leading up point, then retrain the model for another test point somewhere
else in the series.

Anomaly detection: Why is temporal correlation a problem? - Answer-So long as a
datapoint is correlated with a previous point, trend can emerge even though data is
randomly generated (e.g. previous point + noise)

,Anomaly detection: How to remove temporal correlation? - Answer-Take the
difference between consecutive datapoints

Anomaly detection: How to detect contextual anomaly? - Answer-Take a sliding
window over the data, for each window compute a profile of statistics, then you have
an idea of what a 'normal' window/context looks like.

Anomaly detection: What does OSVM optimize? - Answer-Maximize outlier space,
minimize inlier space

Anomaly detection: How to use classifier for novelty detection? - Answer-Assume
training data are all inliers (+ class). Then populate feature space with outliers (-
class), then train the classifier to find the boundary of the inliers

Anomaly detection: Describe isolation forest - Answer-Pick a random dimension to
split by, then split by a random value, repeat till all datapoints are in its separate leaf.

Anomaly detection: Do inliers have lower or higher score according to isolation
forest? - Answer-Higher, they are in denser regions, so harder to isolate.

Distance: What is the formula for KL(P||Q)? - Answer-Sum P(x) log(P(x)/Q(x))

Distance: What dose KL(P||Q) measure? - Answer-The extra number of bits required
to encode P given the encoding for Q.

Distance: What is the computation advantage of JS divergence compared to KL
divergence - Answer-JS divergence is not infinite when P or Q is 0

Distance: How to rescale time series using Z transform - Answer-Compute mean and
std over the whole series, then for each point c, rescale to c(-mean)/std

Distance: How to build the DTW table and use it to compute DTW distance? -
Answer-Let x go on the down the column and y go along the row, then A_{i,j} is the
distance (| | or ^2) between x_i and y_j. Distance is the computed as the shortest
path from top left to bottom right, without going up or left

Dimensionality Reduction: What are the assumptions of PCA - Answer-1)
Relationship between features is linear
2) Directions of greatest variance are most informative

Dimensionality Reduction: Does PCA remove correlation? - Answer-No, PCA
removes linear correlation, not correlation in general (e.g. if 2 features relate by
some quadratic, PCA won't decouple them)

Dimensionality Reduction: What are some advantages of PCA - Answer-Fast,
interpretable, removes linear correlation

Dimensionality Reduction: What are some disadvantages of PCA - Answer-Cannot
capture nonlinear correlation, minimizes residual in L2 norm (implicitly uses
Euclidean distance)

, Dimensionality Reduction: What are some disadvantages of manifold learning -
Answer-1) It assumes that data lies on some lower dimensional manifold
2) Manifold learning algorithms usually preserves local structure, but not global
structure

Dimensionality reduction: What is the crowding problem? - Answer-There isn't
enough area in embedding space to preserve the volume occupied by a
neighborhood of points in the original space. So points further apart in the original
space will become closer in the embedded space, creating crowded regions &
reducing separability.

Dimensionality reduction: How does tSNE improve on SNE? - Answer-Easier to
optimize loss function: KL divergence between original & embedded distribution,
optimized using gradient descent with momentum
Reduces crowding problem: Heavy tail of t-distribution keeps distance points farther
apart

Dimensionality reduction: What are the advantages of tSNE - Answer-1) Captures
nonlinear correlation
2) Preserves local and global structure

Dimensionality reduction: What are the disadvantages of tSNE - Answer-1) Does not
preserve distance
2) Lots of parameters to tune
3) O(N^2) (slow)

Dimensionality reduction: What are the assumptions of UMAP? - Answer-1)
Assumes data is uniformly distributed on some manifold.
2) Assumes the manifold is connected
3) Reduction by finding a lower dimensional graph with similar topological structure

Dimensionality reduction: What are the advantages of UMAP - Answer-1) Faster
than tSNE
2) Can adjust how well local vs global structure is preserved

Dimensionality reduction: What are the disadvantages of UMAP - Answer-1) Need to
tune a lot of parameters
2) Does not preserve distance between points & size of clusters

Clustering: Why does k means require Euclidean distance? - Answer-1) It minimizes
residual measured with L2 norm (sum of square error).
2) Centroid is defined in Euclidean space

Clustering: How to approximate k means with cosine distance? - Answer-Normalize
all datapoints so they all line on a unit sphere

Clustering: Describe DBScan - Answer-1) Determine core, border, and noise points
using density (num neighbours) threshold
2) Connect core points within distance epsilon
€17,92
Krijg toegang tot het volledige document:

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten


Ook beschikbaar in voordeelbundel

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
Freshy Oxford University
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
52
Lid sinds
1 jaar
Aantal volgers
4
Documenten
6784
Laatst verkocht
1 dag geleden

3,6

10 beoordelingen

5
3
4
4
3
1
2
0
1
2

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via Bancontact, iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo eenvoudig kan het zijn.”

Alisha Student

Veelgestelde vragen