Computer Science > Solutions Guide > Georgia Institute Of TechnologyISYE 6501Week_10_Homework_Solutions_-_Spring2021.VERIFIED CORRECT SOL (All)

Georgia Institute Of TechnologyISYE 6501Week_10_Homework_Solutions_-_Spring2021.VERIFIED CORRECT SOLUTIONS

Document Content and Description Below

WEEK 10 HOMEWORK – SAMPLE SOLUTIONS IMPORTANT NOTE These homework solutions show multiple approaches and some optional extensions for most of the questions in the assignment. You don’t need to ... submit all this in your assignments; they’re included here just to help you learn more – because remember, the main goal of the homework assignments, and of the entire course, is to help you learn as much as you can, and develop your analytics skills as much as possible! Question 14.1 The breast cancer data set breast-cancer-wisconsin.data.txt from http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/ (description at http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29 ) has missing values. 1. Use the mean/mode imputation method to impute values for the missing data. 2. Use regression to impute values for the missing data. 3. Use regression with perturbation to impute values for the missing data. 4. (Optional) Compare the results and quality of classification models (e.g., SVM, KNN) build using (1) the data sets from questions 1,2,3; (2) the data that remains after data points with missing values are removed; and (3) the data set when a binary variable is introduced to indicate missing values. Here’s one possible solution. Please note that a good solution doesn’t have to try all of the possibilities in the code; they’re shown to help you learn, but they’re not necessary. The file solution 14.1.R shows one possible solution. In it, missing data is identified (only variable V7 has any, and it is only a small amount). Five different data sets are created to deal with the missing data: (1) Replacing missing values with the mode. This could have gone either way (mode or mean). The data is categorical, but it takes integer values from 1 to 10, and as we’ll see later the values seem to have some relative meaning, so they’re also somewhat continuous. (2) Using regression to estimate missing values. Here too could have gone either way (see above)… but since we didn’t cover multinomial logistic regression in this course, the solutions treat the data as continuous for this part. Once the missing values are estimated, the estimates are rounded (because the original values are all integer) and values larger or smaller than the extremes are shrunk to the extremes. (3) Using regression plus perturbation. (4) Removing rows with missing data. (5) Adding a binary variables to indicate when data is missing, and adding the necessary interaction variables also. Once the data sets have been created, we use KNN (for k=1,2,3,4,5) and SVM (C=0.0001,0.001,0.01,0.1,1,10) to create classification models, and measure their quality [Show More]

Last updated: 1 year ago

Preview 1 out of 7 pages

Add to cart

Instant download