QMB3302 Final assist
ENSEMBLE- Which of the following statements best describes an ensemble method in
machine learning? - answer A technique that combines the results of multiple models to
improve overall predictive accuracy (There are various ensemble methods, but the two
main types are:
Bagging (Bootstrap Aggregating): In bagging, multiple instances of the same learning
algorithm are trained on different subsets of the training data. Each model in the
ensemble is trained independently, and the final prediction is often made by averaging
or taking a vote among the predictions of individual models. Random Forest is a popular
example of a bagging ensemble method.
Boosting: In boosting, models are trained sequentially, with each new model focusing
on the mistakes made by the previous ones. The predictions of individual models are
weighted and combined to form the final prediction. Examples of boosting algorithms
include AdaBoost, Gradient Boosting, and XGBoost.
Ensemble methods are effective because they can reduce overfitting, improve
generalization, and handle different aspects of the data that may be challenging for a
single model to capture.)
KMEANS AlGORITHM- - answerIn kmeans- the algorithm has multiple iterations. If we
have a simple 2d problem, and a k =2. After the initial centroid, measuring the distance
of each point or record to it after
(Initialization:
Randomly choose K initial cluster centroids.
Assignment (Expectation) Step:
Assign each data point to the nearest centroid.
Update (Maximization) Step:
Recalculate centroids based on the mean of points in each cluster.
Convergence:
Repeat steps 2 and 3 until centroids stabilize (convergence).
Result:
Obtain K clusters with data points assigned to each.
Considerations:
Sensitive to initial centroids.
Assumes spherical clusters.
Requires pre-specification of K.
Not robust to outliers.
Remember, it's crucial to run K-means multiple times with different initializations and
choose the solution with the lowest sum of squared distances for better results.)
, RANDOM FOREST ALGORITHM- - answerThe random forest algorithm prevents, or at
least avoids to some extent, the problems with overfitting found in decision trees.
(TRUE)
(Bootstrap Sampling:
Create multiple random subsets (bootstrapped samples) from the original dataset.
Decision Tree Construction:
Build a decision tree for each subset with random feature selection at each split.
Voting (Classification) or Averaging (Regression):
Aggregate predictions from all trees by majority vote (classification) or averaging
(regression).
Ensemble Result:
Obtain a robust and accurate prediction by combining individual tree predictions.
Key Features:
Reduces overfitting by combining diverse trees.
Handles missing values and maintains accuracy on complex datasets.
Provides feature importance information.
Considerations:
May be computationally intensive due to multiple trees.
Tends to be less interpretable compared to individual decision trees.
Random Forest is a powerful ensemble learning method that excels in various tasks,
including classification and regression, and is known for its robustness and ability to
handle complex datasets.)
COMMON CASE FOR RANDOM FOREST ALGORITHM - answerClassifying data into
categories based on input features
SUPERVISED LEARNING- - answerA machine learning approach where the algorithm
receives labeled data and learns to map inputs to outputs based on those labels
CLUSTERING ALGORITHM- - answerSituation of use: You have a dataset containing
customer data for Cheesecake Factory and you want to look at customer spending at
the restaurant in order to find patterns among customers who share similar
characteristics
(K-means clustering is an unsupervised machine learning algorithm designed to
partition a dataset into K distinct, non-overlapping subsets (clusters). The algorithm
aims to minimize the variance or sum of squared distances between data points and
their assigned cluster centroids.)
(Initialization:
Randomly select K initial cluster centroids.
Assignment (Expectation) Step:
Assign each data point to the nearest centroid.
Update (Maximization) Step:
Recalculate centroids based on the mean of points in each cluster.
ENSEMBLE- Which of the following statements best describes an ensemble method in
machine learning? - answer A technique that combines the results of multiple models to
improve overall predictive accuracy (There are various ensemble methods, but the two
main types are:
Bagging (Bootstrap Aggregating): In bagging, multiple instances of the same learning
algorithm are trained on different subsets of the training data. Each model in the
ensemble is trained independently, and the final prediction is often made by averaging
or taking a vote among the predictions of individual models. Random Forest is a popular
example of a bagging ensemble method.
Boosting: In boosting, models are trained sequentially, with each new model focusing
on the mistakes made by the previous ones. The predictions of individual models are
weighted and combined to form the final prediction. Examples of boosting algorithms
include AdaBoost, Gradient Boosting, and XGBoost.
Ensemble methods are effective because they can reduce overfitting, improve
generalization, and handle different aspects of the data that may be challenging for a
single model to capture.)
KMEANS AlGORITHM- - answerIn kmeans- the algorithm has multiple iterations. If we
have a simple 2d problem, and a k =2. After the initial centroid, measuring the distance
of each point or record to it after
(Initialization:
Randomly choose K initial cluster centroids.
Assignment (Expectation) Step:
Assign each data point to the nearest centroid.
Update (Maximization) Step:
Recalculate centroids based on the mean of points in each cluster.
Convergence:
Repeat steps 2 and 3 until centroids stabilize (convergence).
Result:
Obtain K clusters with data points assigned to each.
Considerations:
Sensitive to initial centroids.
Assumes spherical clusters.
Requires pre-specification of K.
Not robust to outliers.
Remember, it's crucial to run K-means multiple times with different initializations and
choose the solution with the lowest sum of squared distances for better results.)
, RANDOM FOREST ALGORITHM- - answerThe random forest algorithm prevents, or at
least avoids to some extent, the problems with overfitting found in decision trees.
(TRUE)
(Bootstrap Sampling:
Create multiple random subsets (bootstrapped samples) from the original dataset.
Decision Tree Construction:
Build a decision tree for each subset with random feature selection at each split.
Voting (Classification) or Averaging (Regression):
Aggregate predictions from all trees by majority vote (classification) or averaging
(regression).
Ensemble Result:
Obtain a robust and accurate prediction by combining individual tree predictions.
Key Features:
Reduces overfitting by combining diverse trees.
Handles missing values and maintains accuracy on complex datasets.
Provides feature importance information.
Considerations:
May be computationally intensive due to multiple trees.
Tends to be less interpretable compared to individual decision trees.
Random Forest is a powerful ensemble learning method that excels in various tasks,
including classification and regression, and is known for its robustness and ability to
handle complex datasets.)
COMMON CASE FOR RANDOM FOREST ALGORITHM - answerClassifying data into
categories based on input features
SUPERVISED LEARNING- - answerA machine learning approach where the algorithm
receives labeled data and learns to map inputs to outputs based on those labels
CLUSTERING ALGORITHM- - answerSituation of use: You have a dataset containing
customer data for Cheesecake Factory and you want to look at customer spending at
the restaurant in order to find patterns among customers who share similar
characteristics
(K-means clustering is an unsupervised machine learning algorithm designed to
partition a dataset into K distinct, non-overlapping subsets (clusters). The algorithm
aims to minimize the variance or sum of squared distances between data points and
their assigned cluster centroids.)
(Initialization:
Randomly select K initial cluster centroids.
Assignment (Expectation) Step:
Assign each data point to the nearest centroid.
Update (Maximization) Step:
Recalculate centroids based on the mean of points in each cluster.