DATA MINING FINAL EXAM
QUESTIONS AND ANSWERS
ANN (Artificial neural Networks)? Execution? - Answer-Attempt to replicate non-
linear learning found in nature.
Execution:
1. prepare data
2. design network architecture
3. initalize
4. train using back propogationg
5. evaluate
What isn SVM (Support Vector Machine)? Execution? - Answer-A versatile algorithm
that maximizes the margin between classes by finding the optimal hyperplane
suitable for complex classification and regression tasks.
Execution:
1. prepare/split data
2. train with suitable parameters
3. evaluate
What are Bayesian Methods? Execution? - Answer-Methods that use Baye's
theorem to compute and update probabilities after obtaining data
Execution:
1. define prior probabilities
2. update using observed data to obtain posterior probabilities
3. perform inference or predictions
What are representative ensemble methods and their main idea? - Answer-Uses a
combination of models to increase accuracy.
ex. bagging, boosting, random forest, ensemble
What is clustering? - Answer-A collection of data objects similar to one another
within the same group
Good vs Bad clustering? - Answer-Good: high intra-class similarity (cohesive within
cluster), low inter-class similarity (distinctive between clusters)
Bad: shows poor separation and lack of clear structure
What are k-means? - Answer-Each cluster is represented by the center of the
cluster. An algorithm that partitions data into a specified number of clusters by
assigning each data point to the nearest cluster based on mean distance.
What are k-mediods? - Answer-Uses medoids (most centrally located object in
cluster) as a reference point instead of the mean
, What is AGNES? (Agglomerative Nesting) - Answer-Uses single-linkage method and
dissimilarity matrix, merge nodes that have the lowest dissimilarity and progress until
all nodes are in the same cluster
How does BIRCH (Balanced Iterative Reducing Clustering using Hierarchies) work?
- Answer-designed for large datasets, incrementally builds a hierarchical data
structure called a CF clustering feature to manage data. Uses a 2-phase approach to
create and refine clusters.
DBSCAN Pros vs Cons - Answer-pros: resistant to noise/ can handle clusters of
different shapes and sizes
cons: cannot handle varying densities/ sensitive to parameters
How to run DBSCAN? - Answer-1. randomly select point p
2. retrieve all points density reachable from p wrt Eps and MinPts
3. Continue process until all points have been processed
Only one scan is needed
How does Kohonen network? - Answer-1. competition
2. cooperation
3. adaptation
4. adjust the learning rate and neighborhood size as needed
5. stops when termination criteria is met
Hopkins Statistic - Answer-measures used to assess the clustering tendency of a
dataset by quantifying the degree of clustering vs randomness in the data
Silhouette Coefficient - Answer-Evaluates clustering quality by measuring
compactness and separation
What is association rule mining and the motivation? - Answer-A technique used in
data mining to discover relationships or associations between variables in large
datasets.
motivation: finding inherent regularities in data
What is frequent itemset mining? How is it done? - Answer-A task in data mining that
involves identifying sets of items that regularly occur together in a dataset.
Apriori - Answer-Efficiently identifies frequent item sets in a dataset and generates
an association based rule based on the itemset
What is confidence interval estimation? - Answer-A statistical technique used to
estimate a range within which a population parameter is likely to lie with a specified
level of confidence.
What is hypothesis testing? - Answer-A procedure used to make inferences about a
population based on sample data
QUESTIONS AND ANSWERS
ANN (Artificial neural Networks)? Execution? - Answer-Attempt to replicate non-
linear learning found in nature.
Execution:
1. prepare data
2. design network architecture
3. initalize
4. train using back propogationg
5. evaluate
What isn SVM (Support Vector Machine)? Execution? - Answer-A versatile algorithm
that maximizes the margin between classes by finding the optimal hyperplane
suitable for complex classification and regression tasks.
Execution:
1. prepare/split data
2. train with suitable parameters
3. evaluate
What are Bayesian Methods? Execution? - Answer-Methods that use Baye's
theorem to compute and update probabilities after obtaining data
Execution:
1. define prior probabilities
2. update using observed data to obtain posterior probabilities
3. perform inference or predictions
What are representative ensemble methods and their main idea? - Answer-Uses a
combination of models to increase accuracy.
ex. bagging, boosting, random forest, ensemble
What is clustering? - Answer-A collection of data objects similar to one another
within the same group
Good vs Bad clustering? - Answer-Good: high intra-class similarity (cohesive within
cluster), low inter-class similarity (distinctive between clusters)
Bad: shows poor separation and lack of clear structure
What are k-means? - Answer-Each cluster is represented by the center of the
cluster. An algorithm that partitions data into a specified number of clusters by
assigning each data point to the nearest cluster based on mean distance.
What are k-mediods? - Answer-Uses medoids (most centrally located object in
cluster) as a reference point instead of the mean
, What is AGNES? (Agglomerative Nesting) - Answer-Uses single-linkage method and
dissimilarity matrix, merge nodes that have the lowest dissimilarity and progress until
all nodes are in the same cluster
How does BIRCH (Balanced Iterative Reducing Clustering using Hierarchies) work?
- Answer-designed for large datasets, incrementally builds a hierarchical data
structure called a CF clustering feature to manage data. Uses a 2-phase approach to
create and refine clusters.
DBSCAN Pros vs Cons - Answer-pros: resistant to noise/ can handle clusters of
different shapes and sizes
cons: cannot handle varying densities/ sensitive to parameters
How to run DBSCAN? - Answer-1. randomly select point p
2. retrieve all points density reachable from p wrt Eps and MinPts
3. Continue process until all points have been processed
Only one scan is needed
How does Kohonen network? - Answer-1. competition
2. cooperation
3. adaptation
4. adjust the learning rate and neighborhood size as needed
5. stops when termination criteria is met
Hopkins Statistic - Answer-measures used to assess the clustering tendency of a
dataset by quantifying the degree of clustering vs randomness in the data
Silhouette Coefficient - Answer-Evaluates clustering quality by measuring
compactness and separation
What is association rule mining and the motivation? - Answer-A technique used in
data mining to discover relationships or associations between variables in large
datasets.
motivation: finding inherent regularities in data
What is frequent itemset mining? How is it done? - Answer-A task in data mining that
involves identifying sets of items that regularly occur together in a dataset.
Apriori - Answer-Efficiently identifies frequent item sets in a dataset and generates
an association based rule based on the itemset
What is confidence interval estimation? - Answer-A statistical technique used to
estimate a range within which a population parameter is likely to lie with a specified
level of confidence.
What is hypothesis testing? - Answer-A procedure used to make inferences about a
population based on sample data