GEORGIA Tech, ISYE Full course, Graded A+, 2022 update

Document Content and Description Below

Week 1 Why Analytics? 6 Data Vocabulary 7 Classification 8 Support Vector Machines 11 Scaling and Standardization 13 k-Nearest Neighbor (KNN) 13 Week 2 Model Validation 16 Validation and Test... Sets 17 Splitting the Data 18 Cross-Validation 20 Clustering 21 Supervised vs. Unsupervised Learning 22 Week 3 Data Preparation 25 Introduction to Outliers 25 Change Detection 27 Week 4 Time Series Data 31 AutoRegressive Integrated Moving Average (ARIMA) 34 Generalized Autoregressive Conditional Heteroskedasticity (GARCH) 34 Week 5 Regression 37 Regression Coefficients 39 Causation vs. Correlation 39 Important Indicators in the Output 40 Week 6 De-Trending 43Principal Component Analysis (PCA) 44 Week 7 Intro to CART 47 How to Branch 48 Random Forests 48 Logistic Regression 49 Confusion Matrices 51 Week 8 Intro to Variable Selection 53 Models for Variable Selection 53 Choosing a Variable Selection Model 56 Week 9 Intro to Design of Experiments 59 Factorial Design 60 Multi-Armed Bandits 61 Intro to Advanced Probability Distributions 62 Bernoulli, Binomial, and Geometric Distributions 62 Poisson, Exponential and Weibull Distributions 63 Q-Q Plot 65 Queuing 66 Simulation Basics 66 Prescriptive Simulation 68 Markov Chains 68 Week 10 Intro to Missing Data 71 Dealing with Missing Data 71 Imputation Methods 72 Intro to Optimization 73Elements of Optimization Models 74 Modeling with Binary Variables 74 Week 11 Optimization for Statistical Models 76 Classification of Optimization Models 79 Stochastic Optimization 81 Basic Optimization Algorithms 82 Non-Parametric Models 82 Bayesian Modeling 83 Communities in Graphs 83 Neural Networks and Deep Learning 84 Competitive Models 86Life is full of mysteries. Although that can feel a bit overwhelming at times, the interesting thing is that we can use math to explain a lot of what we see as the "unknown." In fact, that's the goal of the field of analytics. Rather than looking at our businesses or organizations and wondering what will work and what won't, we can use analytics to sift through our data to explain why something happened or why one idea will work while another won't. If you're interested in learning more about how that works, you're in the right place. In this post, I go through the content in week 1 of ISYE 6501 to make sense of what analytics is and how we can use simple machine learning models to make better decisions. Why Analytics? We can use analytics to answer important questions, and we can break those questions down into three types: ● Descriptive Questions: What happened? What effect does spin rate have on how hard someone hits the ball? Which teachers in the school produce the best exam results? ● Predictive Questions: What will happen? How much will the global temperature increase in the next 100 years? Which product will be most popular? ● Prescriptive Questions: What action(s) would be best? When and where should firefighters be placed? How many delivery drivers should the pizza shop have on hand on certain days and times? In short, we can use analytics to make sense of the world around us and to make better decisions in a complex world. And we do this through something called modeling.Modeling is a way to mathematically explain a real-world situation so that we can understand why something happened (or will happen) and what we can do about it, but people often use the word 'modeling' to mean three different things: ● A real-life situation expressed as math ● Analyze the math ● Turn math analysis back to real-life solution To give an example of how the word 'model' can be used in different ways, know that all of the following are 'models': ● Regression ● Regression based on size, weight, and distance ● Regression estimate = 37 + 81 x Size + 76 x Weight + 4 x Distance Later in this post, we'll look at some of the more popular machine learning models. Data Vocabulary Data Table: A display of information in a grid-like format of rows and tables. Row: Contains the record or data of the columns Column: Contains the name, data type, and any other attribute of the dataStructured Data: Data that can be stored in a structured way (like in the table above). Unstructured Data: Data not easily stored or described (i.e. text from social media) Quantitative Data: Numbers with a meaning (i.e. 3 baseballs) Categorical Data: Numbers without meaning (i.e. an area code or country of origin) Binary Data: Data that takes one of two values (i.e. yes or no) Unrelated Data: No relationship between data points (i.e. players on different teams) Time Series Data: Same data recorded over time (i.e. an athlete's performance over time) Scaling Data: Transforming your data so that features are within a specific range (i.e. 0-1) Standardizing Data: Change your observations so they can be described as a normal distribution Validation: Verifying that models are performing as intended Classification Classification is just what it sounds like: putting things into categories. In the real world, what this might look like would be an email service classifying an email as spam or not spam or an artificial intelligence umpire classifying a pitch as a ball or a strike.The simplest classification would be something like 'yes' or 'no'. But you can have several categories as well. For example, you might want to break a population down by household income: ● $59,990 or less per year ● $60,000-$99,999 ● $100,000 and up To use classification as a prediction, you need other data points. In the example above, we could collect data on things like education level, gender, race, age, or a range of other options that we could use to predict which category people will fall into. We want to choose a good classifier when using classification to minimize our errors.However, choosing a classifier isn't always as simple as the graph above because our data won't always split as cleanly as this example. Sometimes it might look like this. In an example like this, we need to figure out what classifier minimizes our errors so we still have something productive. So, when using classification, you have two different classifier options:● Hard Classifiers: Classifies into groups perfectly ● Soft Classifiers: Gives as good of a separation as possible How you decide where to draw your classifying line depends a lot on the outcomes of what you're classifying. If you're classifying whether something is explosive or not explosive, for example, you may want to err on the side of classifying a non-explosive object as explosive rather than the other way around. One important thing to note is if you have a situation where you're classifier is either completely horizontal or vertical. If your classifier line is vertical, only the variable on the x-axis matters for classification, while if your classifier line is horizontal, only the variable on the y-axis matters for classification. Support Vector Machines Support vector machines are supervised machine learning models used for classification. The name support vector comes from the idea of having a line that touches the edge of the shape (or 'supports' it) is called a support vector. The support vector machine automatically (machine) determines support vectors, or the points supporting the shape on parallel lines. The goal is to maximize (or optimize) the space between the support vectors to minimize errors between the classes. ● n data points, ● m number of attributes, ● Xij is ith attribute of j data point (i.e. i = 1 is credit score of person j, i = 2 is income of person j)● yj would be considered the response, and it could be colored based on the classifier line. The classification line equation, where a0 is the intercept, and ai is the coefficient (if close to 0, it is not relevant for classification) looks like the following. The goal of SVM is to maximize the margin (distance between two classification lines), keeping everything classified correctly. ● In soft classification, you trade off maximizing the margin, and minimizing the error. The following equation seeks to minimize the error and the margin. Lambda λ controls the weight, so as it grows, the margin outweighs any error, and as it becomes zero, minimizing mistakes becomes much more important. We can add a multiplier mj per error to weigh the errors, with the larger multiplier being more important than a smaller one. Although the example I showed in the graph above is just two dimensions, you can use SVMs on data with as many dimensions as you want. It's also important to note that the classifier line doesn't need to be a straight line.Finally, you don't need to always classify into hard buckets because you can use SVMs to give recommendations in probabilities. For example, you could build your SVM and then say there's a 87% chance that person A falls into the $100,000+ income category. Scaling and Standardization One issue with SVMs models is that our model may be thrown off if we have data that varies widely in range. Remember that SVM's goal is to maximize the distance between the separating plane and the support vectors. If one feature is much bigger than another (i.e X1 is .3-.6 and X2 is 1000-2000), the large range will dominate the model and throw off our results. So, you need to scale (or normalize) your data so that all features are, for example, between 0 and 1 (the most common scaling). However, you can also scale to a normal distribution. When you scale to a normal distribution (standardizing), you scale the data to a mean of 0 and a standard deviation of 1. So, when do you use each one? You use scaling (or normalizing) when you're working with data in a bounded range like: ● Batting average (.000-1.000) ● SAT Scores (200-800) On the other hand, you use standardization with certain models like: ● Principal component analysis (PCA) ● Clustering Sometimes, you just have to try both to see which one works better. k-Nearest Neighbor (KNN) The k-Nearest Neighbor (KNN) algorithm is another algorithm used to classify data. Rather than using a line to separate data into classes, the KNN algorithm classifies data by looking at a data point's "nearest neighbors." Based on its neighbors, the algorithm then classifies the data.The amount of points we use as neighbors is referred to as 'k'. There's no set number of 'k' neighbors to use - it could be 5, 7, etc - it just comes down to testing and validating to see what returns the best results. And you can use this model to classify data into multiple classes. There are three important things to note about KNN when classifying points that will affect your analysis: ● There's more than one way to measure distance (straight line is the most common, but there are others as well) ● Some attributes might be more important than others in classification ● Unimportant attributes can be removed In the validation stage, you nail these down. We can use analytics to make sense of the world around us. One common strategy analysts use to do this is to use classification models. Specifically, Support Vector Machines and k-Nearest Neighbors are two of the most common classification models that we use. We can use these strategies to classify everything from whether or not email is spam to whether or not a foreign object is a bomb or not. They will be highly useful models for you moving forward.This week, we cover several concepts. We start off with model validation, work into validation and test sets, splitting data, cross-validation, clustering, and the difference between supervised and unsupervised machine learning models. Let's dive in. Model Validation When you take your model to whoever is overseeing your project (or whoever you're trying to convince of some hypothesis you have), the first thing they'll ask is likely, "how good is your model?" In other words, how accurate is it? You answer that question by validating your model (validation is determining how well your model performs) This could be: ● How well it predicts who will win a tournament ● How well it predicts spam ● How well it predicts a successful application Or any problem that we're trying to solve with a model. When validating our model, it's important to know that our data has two types of patterns: ● Real Effect: Real relationship between attributes and response ● Random Effect: Random, but looks like a real effect If we fit a solution to our training set, we're finding something that fits both the real effects and the random effects. To solve this problem, we need to run our data on different data because only real effects will remain (the random effects on the new data will be different). For example, let's say I poll a group of people at a bar about their eye color. If I took a set of NBA players and created a model based on their attributes, I may come up with a model that predicts their height well. However, if I then apply that model to the general population, it likely won't be as accurate because NBA players are often outliers in terms of their size.That's why we can't measure a model's effectiveness on the data it was trained on. But there's a solution to the problem. Validation and Test Sets To get around this problem, we need two sets of data. One set of data will be a larger set of data to fit the model, while the second set of data will measure the model's effectiveness. That's why we split the dataset into two sets: the training set and the validation set. To further weed out any randomness in the model, we put the data through a third dataset called the test set. In short, it goes as follows: 1. The training set is for building a model 2. The validation set is for picking a model 3. The test set is for estimating the performance of the model we picked You can use the following flow chart to better understand when to use each.Next, we'll discuss how to split that data. Splitting the Data While there are no rules for how big each dataset should, there are some guidelines. For example, when working with one model (and only need a training and test set), the rule of thumb is: ● 70-90% training ● 10-30% test When comparing models (and need training, validation, and test sets), the rule of thumb is a little bit different: ● 50-70% training ● Split the rest evenly between validation and test set Again, the image from above does a good job highlighting this.In terms of how to actually split up the data, there are different methods. One of those methods works as follows (assuming we have a 1,000 data points and want to split it into a 60% training set, a 20% validation set, and a 20% test set). It's called the Random method: ● Randomly choose 600 data points for the training set ● Randomly choose 200 data points for validation ● Use the remaining 200 data points for the test set The second method is called the Rotation method: ● Take turns choosing points ● Training -> Validation -> Training -> Test -> TrainingOne advantage of the Rotation method is that the data points are split more evenly so that we don't clump data together. For example, if we have 20 years of data, the random method might end up putting a higher portion of data from years 1-5 into the training set than the rest of the data. However, rotation can introduce its own bias issues. If you have a 5-point rotation and are working with Monday-Friday data, all Monday data would end up in one set, all Tuesday in another, etc. A solution to that problem would be to combine the two methods. You could randomly split 50% of Monday into the training set, 50% of Tuesday, etc. Cross-Validation Some people worry that important data points may get left out of the training data. Cross-validation is a way to work around that issue, and there are several types of cross-validation. A few are: ● k-means ● k-nearest neighbor ● k-fold The k-fold method allows us to use every part of the data set to train.After going through this process, you will have used every data point to train your models - so no data will have been left out. You can then average the k evaluations to estimate the model's quality. Although there's no standard number to use as k, 10 is common. Clustering Clustering in analytics is similar in meaning to what we think of it as in everyday life. Clustering just means taking a set of data points and dividing them up into groups so each group contains points that are similar to each other. One reason we would use clustering would be to segment a market to improve your messaging (i.e. email marketing). For example, some people might buy a course because it increases their income, while another cluster of people may be most likely to buy a course because they just like to learn. Clustering allows you to identify these people and serve them the appropriate message. The following measures the distance between points, or the P-norm distance, where P is the power in the equation. For ∞ norm distance, the biggest distance between a set of points.One example of clustering is k-means clustering. With k-means clustering, the goal is to partition n observations (however many data points you have) into k clusters (however many clusters you want to create) - and each observation will belong to the cluster with the nearest mean. It would look similar to the image above (where n = 15 and k = 3) K-means is a 'heuristic' algorithm, which means that it may not always find the best solution, but it finds good clusterings and it finds them relatively quickly. It's also an example of an expectation-maximization (EM) algorithm. [Show More]

Last updated: 1 year ago

Preview 1 out of 85 pages

Add to cart

Instant download

GET ASSIGNMENT HELP