Georgia Tech, Question 10.1, Questiosns and answers, Rated A+ 2022/2023
Georgia Tech, Question 10.1, Questiosns and answers, Rated A+ 2022/2023 Document Content and Description Below Question 10.1 Using the same crime data set as in Questions 8.2 and 9.1, find the best model you can using (a) a regression tree model, and (b) a random forest model. In R, you can us e the tree package or the rpart package, and the randomForest package. For each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t just stop when you have a good model, but interpret it too). regression tree model As by now we know that the dataset contains only 47 points, for the regression tree model it might be hard to produce many splits or it might end up overfitting and we won’t be able to say for sure that the model would work as effectively with a large dataset. For this classification tree, I did not split the data in training and validation, rather used all the datapoints to create the model. The initial model used "Po1" "Pop" "LF" "NW" , the Residual mean deviance was 47390. This tree had 7 terminal nodes and looked as below – In the next step I pruned this tree with 6 , 4, 4,3 and 2 leaf nodes to look at the residual mean deviances, which kept increasing as I dropped a node. It might seem like leaf nodes = 7 is the best fit model, but because of a very small sample set this is overfitted. To solve this issue I chose to apply cross validation. is shows a cross-validated version of the model. Instead of computing the deviance on the full training data, it uses cross-validated values for each of the 6 successive prunings. We can compare theISYE 6501 Week 7 HW deviance in the outputs of just using with the cross validated deviance and see that the crossvalidated values are rather higher at every step. Just using tests on the training data and so under-reports the deviance. The cv values are more realistic. My random cross validation revealed that even for leafnode = 6 the RMSE is very close to that of 7. So I chose to prune the tree with 6 leaf nodes and then calculated the R2 of both unpruned and pruned models which happened to be very close to each other, withing .72 - .7 range. If the cross validation sampling were done differently, we could get minimum RMSE for some # of leaf nodes, and similarly the regression tree model with “limited” training data may become overfitted. Takeaway – The model shows that po1 is the first variable on which the first split happens and possibly LF is least important one as in the prunes tree this gets dropped first. It also shows that NW is probably more important the Pop as in the same brunch, pruning removed Pop. But kept NW. random forest model For deciding the NodeSize and mtry of the random forest model I created a loop for node size 2 to 15 and mtry values between 1 to 10 and charted their R square values to find the optimal numbers and found that mtry=3 and NOdeSize = 3 gave the highrest R sqr = 0. I applied these values to create the model and Looked at the importance of the variables in the model.ISYE 6501 Week 7 HW Takeaway – The random forest used more number of variables as compared to the regression tree, but did not produce better R sqr values. Possibly it’s because we don’t have enough sample of data for using this method and most of the trees were very similar to each other. From the charts we can see that it seems like increased the number of variables used in ‘sampling and split’ is actually decreasing the accuracy of this model. Question 10.2 Describe a situation or problem from your job, everyday life, current events, etc., for which a logistic regression model would be appropriate. List some (up to 5) predictors that you might use. While sending out targeted emails with offers, our marketing team at a leading automotive company would do a logistic regression modelling to determine the types of email flyers offers certain groups of customers would enact to. The predictors that could be used are – Customer age group, Types od Car they own, age of car, frequency of services availed at dealership, past offer redemption types etc. Based on these customer segmentations and created and the emails are formatted accordingly through sales force. Once the recipients click through them and we get back the sales and service data from dealerships, they constitute back to the model for further adjustments. Question 10.3 1. Using the GermanCredit data set from
Written for
- Institution
- Georgia Tech
- Course
- Georgia Tech
Document information
- Uploaded on
- April 25, 2023
- Number of pages
- 44
- Written in
- 2022/2023
- Type
- Exam (elaborations)
- Contains
- Questions & answers
Subjects
- georgia tech
- questiosns and answers
- find
-
question 101
-
rated a 20222023 document content and description below question 101 using the same crime data set uscrimetxt as in questions 82 and 91
Also available in package deal