It's been a while since my last sab-R-metrics post, and I have not gotten to the real fun stuff yet. I apologize for the long layoff, and it's likely that these will be sparse for the next couple weeks. I have had some consulting opportunities come up, I've got 6 (possibly 7) presentations or co-authored presentations coming up this summer, I had to finish up a dissertation proposal and final exam, and apparently I have to start thinking about getting on the job market. Yeesh!

Today, I'll continue with logistic regression, but first be sure to check out the last few posts on Simple OLS, Multiple Regression, and Scatterplot Matrices. We'll be using some of the lessons from these posts for today's tutorial.

Logistic regression is a generalization of linear regression that allows us to model probabilities of binomial events. Normally in OLS we want to have a dependent variable that is continuous and--optimally--normally distributed. Without getting too in-depth with continuous vs. discrete measures (technically anything we measure is discrete once we measure it and round it even to the umptillionth decimal), it's pretty easy to tell that a dependent variable with 0 representing a "No" and a 1 representing a "Yes" response is not continuous or normally distributed. In this sort of application, we are often interested in probability and how certain variables affect that probability of the Yes or No answer. Since probability must be between 0 and 1, we need to bound our data at these limits. Otherwise you'll get probability predictions above 1 and below 0.

For the most part, using OLS as a quick and dirty way to estimate a binomial variable is reasonable. And for observations that are not near 0 or 1, it can work just fine. But with the available computer power that comes free in R, there is little reason not to fit a Generalized Linear Model to the binomial response (unless of course you have a very large data set). So today, we'll again use the data "hallhitters2.csv" which you can download here. What is more fun than trying to predict Hall of Fame induction? We'll keep things simple today, but it is important to note that this isn't any sort of new idea. Cyril Morong, among others, have used logistic regression for Hall of Fame prediction in the past.

For this exercise, I am going to assume that BBWAA writers only pay attention to traditional statistics. That means we'll only include Home Runs, Runs, Runs Batted In, Hits, Batting Average and Stolen Bases in our regression. Obviously this will be limited, and doesn't account for fielding, 'integrity' or whatever else the BBWAA claims to take into account. But it should be plenty good enough to get the point across.

Let's jump right in. Go ahead and load up the data. I'm going to name mine 'hall':

##set working directory and load data

setwd("c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics")

hall <- read.csv(file="hallhitters2.csv", h=T)

head(hall)

Alright, let's first run a standard OLS model on the data, followed by the logistic regression to compare. I have included a variable called "BBWAA" in the data so that we can use this as our response variable. If a player is voted into the Hall by the BBWAA, they are given a '1', otherwise the variable will be '0'. For logistic regression, we'll use a new R function called "glm", which of course stands for Generalized Linear Models. This function can be used for a number of linear model generalizations and link functions (such as Poisson regression or using the Probit link function for binomial data). We'll stick with the binomial version and logistic regression here.

##start with an OLS version and predict hall of fame probability

fit.ols <- lm(BBWAA ~ H + HR + RBI + SB + R + BA, data=hall)

summary(fit.ols)

##fit logistic regression

fit.logit <- glm(BBWAA ~ H + HR + RBI + SB + R + BA, data=hall, family=binomial(link="logit"))

summary(fit.logit)

Notice the slight difference in coding, as we have to tell R that the GLM we want to use is for binomial data with a 'logit link function'. We'll revisit the OLS model later on, but for now let's focus on the logistic regression coefficients. These are a bit different from the coefficients of our linear models from before. Just reading the coefficients off the table as the increase in probability of induction won't work. The coefficients must be transformed. The logistic regression uses a transformation of the predictors, 'x', in order to model the probability on a 0 to 1 scale. The transformation used here is:

InverseLogit(x) = e^x/(1+e^x)

or

Logit(y) = BXi

Here, B are the coefficients, while Xi represents our predictor variables. So when we look at the coefficients, we have to do something with them in order to make sense out of what we're seeing in the regression output. To get the effect, we can use the following transformation, including the intercept from the regression. Let's check out the association of Home Runs and the probability of induction into the Hall of Fame in as simple a way as possible:

##exponentiate HR coefficient to get probability of induction increase per HR

exp(0.002153)

Which comes out to about 0.215% increase in induction odds per home run. So, for a coefficient like this, we pretty much can read it directly off the table for those players around the average home run total. However, it is important to understand this isn't a coefficient like in OLS. If we simply multiply this change by Hank Aaron's 755 Home Runs, we get a probability of induction for him at about 162%. Remember that in a model like this, the influence of the HR total decreases as the probability of the player reaching the Hall gets closer and closer to 1. Otherwise, we'd end up with probabilities above 1 and below 0. For a good primer on interpreting coefficients from a logistic regression model, check out this site through the UCLA stats department.

Now, it's interesting to look at coefficients, but because of the collinearity issues, they may not be all that useful for interpretation. Multicollinearity is the likely culprit for the coefficients on Hits being negative (and recall our scatter plot matrix). It would probably be optimal to create some sort of orthogonal set of predictors with Principal Components Analysis (PCA) or something of the sort (which I plan to get to in a later sab-R-metrics post). But for now, we'll just use the logistic regression for prediction.

However, another valuable part of logistic regression is its prediction of a probability of a 'success' (here, induction into the Hall of Fame by the BBWAA). Because the coefficients are in log-odds form, we need to use "response" when predicting to get actual probabilities (0 to 1). Go ahead and predict the probability of induction using both the OLS and the Logistic model and attach this to the end of your data set with the following code:

##predict values and append to data

hall$ols.pred <- predict(fit.ols, hall, type="response")

##predict the probability of induction using the logit model

hall$log.pred <- predict(fit.logit, hall, type="response")

head(hall)

From here, it can be easier to look at the output if we create a new table in Excel using this data. Of course, you could copy and paste into Excel if you want, but there's a nice easy option in R to create a table. Keeping with the CSV format I like to work with, we can create a new table in our working directory from the newly updated data set with Hall of Fame probabilities with the following code:

##create table for looking at in Excel

write.table(hall, file="hallhittersPRED.csv", row.names=F, sep=",")

This rather convenient function comes in handy for me very often. In the function above, you tell R that you want to write a new file using the object "hall". Then you name the file. Here, be sure to put the .csv extension on the end. You could also do .txt, and there are of course other options. I use the 'row.names=F' to tell R that I don't need the row names in the new file, but there are cases when this makes sense to allow (i.e. when you are writing a table from the output of the 'tapply()' function). Finally, we tell R what to use as a separator with "sep=". Be sure to put this in quotes. You can use anything you want, but the safest bet is to use commas (assuming your vectors don't have commas in them, but they should have double quotes protecting the fields anyway). I usually use write.table so I have full control over the file; however, you can also use the simple code below and get the same result:

###using write.csv now

write.csv(hall, file="hallhittersPRED.csv", row.names=F)

Go to your working directory and open up the CSV file in Excel. Now, inspect the predicted probabilities of the logit model first. They'll be nice and easy to sort in Excel.

The logit probabilities seem perfectly reasonable, with the likes of Rickey Henderson, Cal Ripken and Tony Gwynn about as near sure candidates as you can get. On the other hand, we see that Kirby Puckett doesn't get as high of a probability. We know why though: Puckett's career was cut short. With our knowledge of baseball at hand, these probabilities mostly seem to make sense.

However, looking at the OLS model, there are some strange things going on. The model doesn't particularly like Ripken or Gwynn. If you go ahead and sort your data by the predicted OLS probabilities, you'll see that there are a number of negative induction probabilities for guys like Greg Pryor and Paul Casanova. We can just round these to zero, but optimally the logit model gives us a better look at what is going on in the data.

As before, it's sometimes useful to look at a plot of our data to see how our predictions fit the true outcomes. However, since we will be comparing 0-1 binomial data to a continuous probability prediction, we'll have to employ some new things in R for this visualization. Luckily, I recently found a very handy function in the "popbio" library for this. The first thing you'll need to do is install this package (I went over how to do this before). Next, use the following code to create a plot with the "logi.hist.plot" function:

##create plot of logit predictions

logi.hist.plot(hall$log.pred, hall$BBWAA, boxp=F, type="hist", col="gray", rug=T, xlab="Hall of Fame Status", mainlabel="Plot of Logistic Regression for BBWAA Inductions")

##add the names of mis-classified HOFers

low.prob <- subset(hall, hall$log.pred < .20 & hall$BBWAA==1)

text(low.prob$BBWAA, low.prob$log.pred, low.prob$last, col="darkred", cex=.8)

As you can see, the second portion of the code adds the names of the Hall of Famers who were predicted to have a very low probability of induction. You can see that the probability of making the hall of fame increases as we get closer to the "1" classification. In the middle, the probability of induction changes at the highest rate, while it tails off at each end in order to bound it between 0 and 1.

We can inspect these and for the most part understand why our model misses these guys. Jackie Robinson and Roy Campanella had shorter careers due to segregation and catastrophic injury, respectively. We didn't include fielding data in our model, so it's no surprise that Ozzie Smith was predicted so low. Guys like Kiner, Bordreau, Carter and Aparicio are less clear. But perhaps these were just inconsistencies in the voting by BBWAA members. Many would argue that these players don't really belong in the HOF anyway.

Now that we have this nice looking picture, how do we really evaluate the usefulness of our model? The most common way to evaluate something predicting binomial data is an ROC curve (and it's AUC-"Area Under Curve"). This benchmarks the performance of your model in predicting the correct class (i.e. 0 or 1). If you perfectly predict, the area under the curve will be 1. If you randomly chose 0 or 1, then you're looking at 0.5 (i.e. a 45 degree line in the plot seen below). Usually, we'll see some sort of curve between 0.5 and 1, but the goal is to get as close to 1 as possible (without over-fitting, of course).

For this, go ahead and install the package 'ROCR' onto your computer. We'll make use of this. Using the code below, you can create the ROC curve and also the AUC

##############ROC Curve

library(ROCR)

preds <- prediction(as.numeric(hall$log.pred), as.numeric(hall$BBWAA))

perf <- performance(preds, "tpr", "fpr")

perf2 <- performance(preds, "auc")

perf2

plot(perf, main="ROC Curve for Hall of Fame Predictions", lwd=2, col="darkgreen")

text(0.5, 0.5, "AUC = 0.9855472", cex=2.2, col="darkred")

The first portion of this code sets up the predictions and the true classes. The second line creates performance vectors using the "tpr" and "fpr" commands, which refer to "True Positive Rate" and "False Positive Rate", respectively. Then we plot them against one another and get the following:

Finally, the "perf2" object calculates the AUC for our model, which we find is a solid 0.986 or so. Of course, this is enhanced by the fact that there are so many easy decisions on non-Hall of Famers. If we included only "borderline candidates" for our AUC calculation, we would get a much lower number. Whether or not you believe it is appropriate to report such a high AUC produced by this fact is another question, but keep in mind that the 'accuracy' may be a bit misleading in this case. We don't need a predictive model to tell us that Joe Schmoe isn't going to be inducted into the Hall of Fame.

Finally, if we've decided that this model is accurate enough, we can apply this model to current players. For this, we'll use the file "currenthitters.csv" from my website here. For this, we want to be sure we have the same variable names for the same variables so that the new code knows what to call from the original logistic regression model. Here's the code below to load in the data, create the predictions, attach them to the data, and write a new table with the added predicted induction probabilities.

##get current hitter data and check variable names

current <- read.csv(file="currenthitters.csv", h=T)

head(current)

##predict probabilities

current$hall_prob <- predict(fit.logit, current, type="response")

head(current)

##write a new file with the induction probabilities

write.table(current, file="currentPRED.csv", row.names=F, sep=",")

Then go ahead and open up the file for easier inspection (or just snoop around in R if you want).

If you want to order them by induction probability in R directly, then use the following code:

###order by most likely to get into hall of fame

current <- current[order(-current$hall_prob),]

current[1:35,]

You can use the ordering code for any variable you want, and I use the "-" sign within the parentheses in order to tell R that I want things in descending order. The default is ascending order. The second line of code just calls the first 35 rows of the data.

You can see that Alex Rodriguez is the most likely in the sample of active players, with Ken Griffey, Jr. just behind him. Most of these make sense. Some are likely not right, like Luis Gonzalez and Bobby Abreu. Others, we know will have problems getting into the Hall based on steroid accusations, which we did not include in our model.

Some players are surprises here, but this is likely because they just don't have enough at bats yet to hit the milestones for the Hall of Fame. Keep in mind that the data are only through 2009 that I've provided, so Ichiro has less than 2,000 hits. Perhaps if we modeled each HOF players' career at each given point in time, we could use this to predict guys that haven't been around very long. But that's quite a long and complicated process, and the current model is as far as I'll take today's lesson. Finally, keep in mind that we are predicting induction probabilities based on past BBWAA inductions, NOT whether or not the player deserves to be in the Hall of Fame. This is a discussion left for another day.

So there you have it: predicting Hall of Fame induction and logistic regression in R. Remember that the 'glm()' function can be extended to other models like Probit and Poisson. Below is the code from Pretty R as usual.

CODE: (HTML IS BEING FUNKY, I WILL TRY TO FIX THIS ISSUE SOON!)

Interesting. I have not read it all very carefully, but someone pointed out to me that 3000 hits pretty much gives us a probability of 1. That might mess up this kind of analysis. Exactly how, I am not sure

ReplyDeleteHey Cy,

ReplyDeleteBecause I don't have any sort of 'benchmark' indicators, the regression just uses an increment in probability for each hit. Ripken, Henderson and Gwynn are at something like 99.98%. This certainly isn't as extensive as the model(s) you have put forth in the past.

Most likely, the probability jump at 3,000 hits is more of a discontinuity (i.e. you go from No to Yes instantly once you hit #3,000). I imagine that's where things change a bit here. I don't account for that perception of milestones.

The high correlation between batting average and hits (and many other things) pretty much throws the coefficients out of whack. But I felt this to be a fun example for logistic regression.

Finally. I thought u had given up on this.

ReplyDeleteOkay, thanks. I understand it better now.

ReplyDeleteSorry JAIME. I've been meaning to get back to this, but wanted to make sure I was all in for posts from here on. They're a bit more involved and I don't want to put anything up here that is completely incorrect.

ReplyDeleteReally did not want to lay off this long, but after working 9 to 11 hours a day with sports and statistics, it was tough to get on and write about R. Things will come (hopefully) once a week for the summer months.

Fantasticly useful. Please do post a follow-up on PCA soon!

ReplyDelete