ISYE 6501 Week 5 HW Latest Update

Document Content and Description Below

Question 8.1 Describe a situation or problem from your job, everyday life, current events, etc., for which a linear regression model would be appropriate. List some (up to 5) predictors that you mig... ht use. Working in Canada’s biggest retail hardware store chain and building up sales analytics from the scratch, we faced a problem with gathering store transactions data due to various legal reasons. Close to 30% of our 1100+ stores across the country were initially very skeptical and reluctant on sharing their data because of a dealer-owner cooperative business model. We were receiving sales data from around 750 stores and based on which we were building dashboards and taking business decisions around various financial/merchandising/marketing goals. But we lacked the complete insight as a big chunk of data was still unknown to us. We could speculate, but it was not good enough to rely upon. At this point, we decided to build a regression model to predict what might the total sales $$ be of those ‘unknown’ stores based on criteria such as – I. Store Area (in sqrft) – expected to be a +ve correlation II. Primary LOB (hardware, building center, furniture etc.) – need to be converted to numeric values, usually stores with builder centers have larger sales $$ III. Monthly avg temp of the area code (as sales could be seasonal) – helps when we are trying to estimate monthly sales $$ for our monthly BI reports, or dealing with seasonality Question 8.2 Using crime data from http://www.statsci.org/data/general/uscrime.txt (file uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html ), use regression (a useful R function is lm or glm) to predict the observed crime rate in a city with the following data: M = 14.0 So = 0 Ed = 10.0 Po1 = 12.0 Po2 = 15.5 LF = 0.640 M.F = 94.0 Pop = 150 NW = 1.1 U1 = 0.120 U2 = 3.6 Wealth = 3200 Ineq = 20.1 Prob = 0.04 Time = 39.0 Show your model (factors used and their coefficients), the software output, and the quality of fit. ISYE 6501 Week 5 HW Note that because there are only 47 data points and 15 predictors, you’ll probably notice some overfitting. We’ll see ways of dealing with this sort of problem later in the course. Ans – The uscrime dataset is has number of offences per 10k population, this is a continuous dataset with a set of possible “predictors” – #Variable Description #M percentage of males aged 14–24 in total state population #So indicator variable for a southern state #Ed mean years of schooling of the population aged 25 years or over #Po1 per capita expenditure on police protection in 1960 #Po2 per capita expenditure on police protection in 1959 #LF labor force participation rate of civilian urban males in the age-group 14-24 #M.F number of males per 100 females #Pop state population in 1960 in hundred thousand #NW percentage of nonwhites in the population #U1 unemployment rate of urban males 14–24 #U2 unemployment rate of urban males 35–39 #Wealth wealth: median value of transferable assets or family income #Ineq income inequality: percentage of families earning below half the median income #Prob probability of imprisonment: ratio of number of commitments to number of offenses #Time average time in months served by offenders in state prisons before their first release #Crime crime rate: number of offenses per 100,000 population in 1960 To understand more about the data, after loading it into a table, I looked at the data summary, looked at the box plot to check any possible outliers. Although I have not removed any data point from the set for this assignment’s purpose, I performed the test mostly for discovery, Crime values 1969 1674 1993 showed up at the highest 3 values outside the whiskers of the boxplot, using the grubbds test we possibly could remove these outliers, but I skipped this step. Later looked at the correlation matrix to check if any pair of variables are corelated to each other or not. I found that there is a strong linear correlation between Po1 and Po2 with correlation coeff = .99. Also, the Wealth and Ineq has a -ve correlation coeff -0.88 and they seem to be very closely negatively correlated. I also checked the scatter plots of predictors against Crime to have visual idea of the correlations, which showed that all of them might not be significant for out model. ISYE 6501 Week 5 HW IV. • lm – In the next step, I first used linear regression model using all the attributes in the dataset to create a baseline. The summary of this model shows only 6 attributes have a p-value < = 0.1 hence they are the only ones possibly significant enough. In real life applications with a bigger volume of data usually this threshold would be at least .05 or lower, but as we do not have enough data and 15 predictors, I am using a wider range. For this model, the R squared = 0.8031 [Show More]

Last updated: 1 year ago

Preview 1 out of 29 pages

Add to cart

Instant download

Buy this document to get the full access instantly

Instant Download Access after purchase

Add to cart

Instant download

Report Copyright Violation

Also available in bundle (1)

BUNDLED PAPERS (Multiple versions) FOR Georgia Institute Of Technology ISYE 6501 Homeworks 1 - 15, Midterm 1 & 2 + FINAL EXAM | ISYE6501x Courseware | edX - Complete Solutions - Introduction To Analytics Modeling - GTX ISYE 6501

GTx: ISYE6501x Introduction to Analytics Modeling Midterm Quiz 2 - GT Students and Verified MM Learners latest 2021 Midterm Quiz 1 - GT Students (Launch Proctortrack first before taking the Midterm Qu...

By Nutmegs 2 years ago

$15