100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Exam (elaborations)

Elaboration of data mining exam questions

Rating
-
Sold
-
Pages
16
Grade
8-9
Uploaded on
21-12-2023
Written in
2022/2023

Elaboration of students' older exam questions with feedback.

Institution
Course










Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Course

Document information

Uploaded on
December 21, 2023
Number of pages
16
Written in
2022/2023
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

Content preview

Date Download: 19/06/2023




Data Mining Exam Questions
Link:
https://docs.google.com/document/d/1P2Za3RewqiRAVlJkEUFZPb1T3H82_3w9AeuhwVeF
vKQ/edit?fbclid=IwAR1lmzkov2kXUnQ-
HWm6LWXP_qW7kWKnwpWPOgeXxbevJvL9Xo0QAFhmqJA

Other questions can be found here: https://wiki.vtk.be/Data_Mining_(H02C6A)
No real reason to create an account - mostly the same and no more detail than the Qs here
→ just be sure to add new questions/exams on the wiki for continuity reasons :)

If a question is answered and confirmed to be correct, mark it green.

If a question is answered but not confirmed to be correct, mark it yellow.

If a question is open and has no answer yet, mark it red.

There is a fixed formula sheet that is provided for you during the exam and it can be found
on Toledo as well (it does not contain all formulas though)


2022 July
1. Logistic regression weight update
2. PCY exercise
3. Calc recommendation of movies and user, with latent factor model -> WE NEED AN
EXAMPLE
4. 5 small questions testing your insights
5. Anomaly detection: You are given a series of graphs for each day (x-axis: time, y-
axis: amount of visitors on a website).
a. Is there anything unusual about the data (For a specific day in the fall the
amount of visitors was double at midnight)
b. If there is anything unusual about the data, is this an anomaly or normal but
unusual behaviour? (It was an anomaly due to the switch from daylight saving
to standard time if i remember correctly)
6. 5 small questions testing your insights.
a. One was about active learning
7. BIRCH vs CURE: Given a set of points, Show how BIRCH (only ellipsoids)/CURE
(can take more complex shapes) would cluster these points (2 clusters)
8. Google created a model in 2008 to predict flu outbreaks by looking at google
searches. The model was fairly accurate up until 2013, afterwards it started
overestimating flu cases, why? I think it might have to do with the rise of social
media, many articles about potential flu outbreaks cause people to search more
about the flu causing the model to overestimate. Correlation != Causation

2022 June
1. Logistic regression weight update

,Date Download: 19/06/2023


2. Max miner exercise
3. Bi projection exercise (I think?)
4. K means vs GMM (same as 2022 Jan)
5. 5 small questions testing your insights
6. Knn for anomalies (not sure)
7. A table of vaccination rates at different age groups. What are 2 potential problems
with this data? Something about simpson's paradox



2022 jan
1. Logistic regression -> but with gradient descent (does this mean we also have to flip
the objective function (multiply L by (-1))) yes
Chat gpt: logistic regression can be trained using various optimization algorithms,
and gradient descent is one of them. Gradient descent is a common optimization
algorithm used to find the optimal parameters for logistic regression, but it is not the
only option.

Logistic regression aims to model the probability of a binary outcome based on a set
of input features. The model applies a logistic (sigmoid) function to a linear
combination of the features to map the continuous input space to a probability
between 0 and 1. The parameters of the logistic regression model are estimated to
maximize the likelihood of the observed data.

Gradient descent is an iterative optimization algorithm that adjusts the model
parameters in the direction of the steepest descent of the loss function. In logistic
regression, the loss function is typically the log-likelihood or the negative log-
likelihood. By taking steps proportional to the negative gradient of the loss function,
gradient descent iteratively updates the parameters until convergence to the optimal
solution.

2. Bilevel projection of
Sequence DB: 10:<c(ad)a>,20:<d(ac)da>,30<c(cd)a(ac)>
What is this? → look at the last lines of sequence mining
3. Max miner algo


4. K means vs GMM




Both 2 clusters -> where would X1 and X2 be after 1 iteration of clustering from these
starting points given the data (for K means and for GMM)

, Date Download: 19/06/2023



How can you estimate this for the GMM case?


Someone who know what the GMM would look like?
=> EM clusering example in slides, plot it out

=> There should be an intuitive way of doing this, no :((((? HELP

5. Short questions (only know the answers not the question)
a. LR and overfitting
b. GBRT with a small LR
c. Run learning algo on data with actively acquired labels
d. Drawback to toivonens algorithm

6. Question DTW( diagram and how to improve DTW to prevent noise)
Someone who knows the answer to this?
There are slides on Longest Common Subsequence (LCSS) that tackle the noise
problem by allowing for gaps. It includes the algorithm and example.

7. KNN for outliers(rank the points from most to least anomalous)
8. Like slide (Some Data Puzzels) p54-55 the table

2021-07-18

1. Exercise on the generate + prune step of apriori (single iteration)
2. Compute LCSS (Time series)
3. Predict movie ratings using collaborative filtering
4. Exercise on complete link agglomerative clustering
5. GMM: rank the points from most to least anomalous
Data is represented by a mixture of Gaussian ⇒ each example x has a probability p(x)
of being generated by the GMM
High p(x) → GMM is probable to generate this sample x → no anomaly
Low p(x) → GMM is unlikely to generate the sample x → anomaly
How can he ask this? Given alpha and probabilities of x belonging to a cluster?
Anomaly detection -> slide 13 → This is kNN for anomalies tho… for distances farthest away
is most anomalous. For GMM / probabilities you want to order from low to high (low chance
to generate this, so hence highly likely anomalous)
6. Convert the data from a training set into the proper format for logistic regression
What are we supposed to do here?
I guess this is related to the fact that logistic regression methods require the input data to be
numerical and therefore you need to convert categorical variables into indicator variables
(dummy coding)
So e.g. when you have data with labels (small, medium large) you can convert it to (0,1,2)?

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
sepm13 Katholieke Universiteit Leuven
Follow You need to be logged in order to follow users or courses
Sold
35
Member since
3 year
Number of followers
26
Documents
10
Last sold
6 months ago

3.0

2 reviews

5
1
4
0
3
0
2
0
1
1

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions