DATA MINING FINAL EXAM Q&A
What constitutes a good cluster? What are our two main goals during clustering? -
Answer-A good cluster consists of a) minimized intra-cluster distances, b) maximized
inter-cluster distances, c) high intra-cluster similarity, and d) low inter-class similarity.
What are the two main clustering methods? Explain the differences between them. -
Answer-The two types of clustering are a) partitional and b) hierarchical. There are
two main differences between these clustering methods. One is that partitional
clustering develops pure clusters, whereas hierarchical clustering creates nested
clusters. Another difference is that hierarchical clustering doesn't have to assume
any particular number of clusters, whereas partitional does.
What are the trade-offs to consider while selecting a minimum support threshold
value? - Answer-If minsup is too, we could miss item sets involving interesting rare
items (e.g., expensive products). Alternatively, if minsup is set too low, it is
computationally expensive and the number of item sets is very large.
Briefly discuss the similarities and differences between association rule mining and
collaborative filtering. - Answer-Association rule mining (ARM) focuses on frequent
item combinations whereas collaborative filtering (CF) focuses on user preferences.
ARM's data rows are single transactions and ignore user dimension, whereas CF's
data rows are user purchases or ratings over time. ARM is used in displays (what
goes with what), whereas CF is useful for recommendations involving unusual items.
Define the Information Retrieval task in text analytics and briefly explain how a
typical Information Retrieval system works - Answer-Information Retrieval is finding
documents whose set of words most closely matches words in query. The system
works in three main steps - 1) taking in the input of the query string 2) ) cross-
referencing this query with the document corpus and 3) ranking the documents.
Support Formula - Answer-x & y / Total
Confidence Formula - Answer-x & y / x
Network Density - Potential Connections - Answer-[n * (n-1)] / 2 (where n = number
of nodes)
Network Density - Actual connections - Answer-# of links
Network Density - Answer-Actual/Potential - # of links / ([n * (n-1)]/ 2)
Distance between nodes - Answer-shortest path
Highest degree centrality - Answer-Node with most links
Briefly explain the initial centroid selection problem in k-means clustering and
suggest possible ways to overcome this problem - Answer-Initial centroid problem
occurs because the centroids used are randomly picked. This can effect the
What constitutes a good cluster? What are our two main goals during clustering? -
Answer-A good cluster consists of a) minimized intra-cluster distances, b) maximized
inter-cluster distances, c) high intra-cluster similarity, and d) low inter-class similarity.
What are the two main clustering methods? Explain the differences between them. -
Answer-The two types of clustering are a) partitional and b) hierarchical. There are
two main differences between these clustering methods. One is that partitional
clustering develops pure clusters, whereas hierarchical clustering creates nested
clusters. Another difference is that hierarchical clustering doesn't have to assume
any particular number of clusters, whereas partitional does.
What are the trade-offs to consider while selecting a minimum support threshold
value? - Answer-If minsup is too, we could miss item sets involving interesting rare
items (e.g., expensive products). Alternatively, if minsup is set too low, it is
computationally expensive and the number of item sets is very large.
Briefly discuss the similarities and differences between association rule mining and
collaborative filtering. - Answer-Association rule mining (ARM) focuses on frequent
item combinations whereas collaborative filtering (CF) focuses on user preferences.
ARM's data rows are single transactions and ignore user dimension, whereas CF's
data rows are user purchases or ratings over time. ARM is used in displays (what
goes with what), whereas CF is useful for recommendations involving unusual items.
Define the Information Retrieval task in text analytics and briefly explain how a
typical Information Retrieval system works - Answer-Information Retrieval is finding
documents whose set of words most closely matches words in query. The system
works in three main steps - 1) taking in the input of the query string 2) ) cross-
referencing this query with the document corpus and 3) ranking the documents.
Support Formula - Answer-x & y / Total
Confidence Formula - Answer-x & y / x
Network Density - Potential Connections - Answer-[n * (n-1)] / 2 (where n = number
of nodes)
Network Density - Actual connections - Answer-# of links
Network Density - Answer-Actual/Potential - # of links / ([n * (n-1)]/ 2)
Distance between nodes - Answer-shortest path
Highest degree centrality - Answer-Node with most links
Briefly explain the initial centroid selection problem in k-means clustering and
suggest possible ways to overcome this problem - Answer-Initial centroid problem
occurs because the centroids used are randomly picked. This can effect the