Wednesday, February 23, 2011

sab-R-metrics: Basic Applied Regression (OLS)

Today, I'll again be using a new data set that can be found here at my website (called 'leagueoutcomes.csv'). The data set includes the standings results of the 2009 season for MLB along with average game attendance by team. I'll use this to go over some basic regression techniques and tools in R. Hopefully this tutorial will help those with some more statistical background. Those looking to use fun data to learn things like Logistic Regression, Probits, and Non-Parametric smoothing methods should use this to get acquainted with the R fitting procedure and come back later for those tutorials.

Before doing any analysis, it's always a good idea to look at the data and make sure what we're doing makes sense. For this, we can create histograms and summarize the data in order to look at the distribution (standard regression assumes normal data, but for our purposes I won't worry too much about this here). Let's load the data and check out some properties of it. All of the code should be familiar from previous sab-R-metrics installments:

##set working directory, load data, and snoop around

setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")
league <- read.csv(file="leagueoutcomes.csv", h=T)

head(league)


summary(league)

hist(league$MLB.attend, col="steelblue", main="Game Attendance")

hist(league$MLB.win, col="pink", main="Win Percent")




plot(league$MLB.attend ~ league$MLB.win, main="Team Attendance and Winning", xlab="W%", ylab="Attendance Per Game", col="steelblue", pch=16)


text(league$MLB.win, league$MLB.attend, league$MLB.team, pos = 1, col="darkgrey")



As you can see in the scatter plot, the teams with better performance have more fans in attendance. Nothing surprising, with the Yankees at the top right and Pirates at the bottom-left. We can also see that the data tend to be a bit skewed for something like attendance (see the histograms), which is not surprising considering it is bounded at 0. A log transform may be appropriate here, but we'll ignore this for now to focus on the functionality of R as a regression tool. In addition, there is also some censoring going on. This is especially true in the NFL where sellouts are a regular occurrence. I won't be getting into much about assumptions behind regressions. There won't be any linear algebra or calculus here and I assume that those reading this have some idea of what regression is and when to use it. Of course, I'll throw in some assumptions of my own (and some "watch outs" like always). Again, I will ignore this for now, and perhaps I'll visit censored and truncated regression on another day (which I find easier to fit in STATA anyway).

Let's begin with a basic regression using a single predictor (OLS). In R, we can run a regression using the command "lm()", which of course stands for "linear model". Just like when plotting a Y variable against an X variable, we write an equation within this function. I'm interested in the relationship between winning and attendance per game. So, first let's calculate a correlation, then use those variables in a regression of Attendance on Win Percent for MLB.

##correlation and regression of attendance and winning

cor(league$MLB.attend, league$MLB.win)


fit.mlb <- lm(league$MLB.attend ~ league$MLB.win)

summary(fit.mlb)

As you can see, winning seems to be a significant predictor of attendance in MLB...no surprise there. The correlation is about 0.68, while the regression coefficient suggests that going from a win percent of 0 to 1.000 would result in about 83,436 additional fans. Of course, the intercept doesn't make much sense in our example (and neither does 83,000 fans with most stadiums having a max capacity around 40,000 fans).

This is a result of going beyond the data. No team had either a 0.000 W% or attendance of 0. Extrapolating beyond the data is bad juju for the most part. When there really are sellouts, I'd suggest toying around with a Tobit model, too. You can always force a regression to go through the origin or transform the data in some way, but there are some issues to deal with there.

So that the regression coefficient makes a little more sense, let's put Win Percent on a scale of 0 to 1000, rather than 0 to 1. Just do a quick transform of your data and rerun your regression with the transformed predictor variable:

##transform win percent

league$MLB.win.new <- league$MLB.win*1000

fit.mlb.2 <- lm(league$MLB.attend ~ league$MLB.win.new)
summary(fit.mlb.2)


Now, each point in win percent (0.001) is associated with an increase of about 83 fans per game. Let's go a little further with interpretation. We know there are 162 games in a season. Therefore, each win is worth about 0.006 in win percent. Therefore, it seems that on average, an MLB team can expect about 512 extra fans per game for each win.

But be careful! This does not mean that winning causes attendance to increase. In the long run, it very well may be the case that attendance and demand for winning results in these teams having a higher win percent. I believe I talked about this last time, but the relationship gets murky. If you're interested in academic work trying to tease out the causal effect, check out the VAR Analysis by Davis in IJSF (2008). But that's just the start of it. There are plenty of omitted variables in our simple regression--not to mention aggregating across cities with very different demographics--so making any strong inferences is not a very good idea. And don't forget that things change over time. This analysis is only from the 2009 season.

Okay...enough with the standard "yeah, buts". They could go on forever.

One of the nice features of R is that each of the fitted models can be saved as an object. Notice that I assigned my regression to the name "fit.mlb" and "fit.mlb.2". To summarize (i.e. give us a regression table like you would see in SPSS), we just use the "summary()" function with the object name. You can also use "print()" or the function "display()", which Gelman and Hill recommend in their Multilevel Models book (much recommended for a surface level review of regression in the early chapters and of course a great applied text for multilevel modeling). Really, it's up to you and what you think provides you with the information you want. We can also use this object-oriented language to our advantage when we want to plot our regression. But first, let's predict our attendance data.

Sometimes, we want to use regression to not only estimate a coefficient on some predictor variable, but also to predict future or unobserved data. Now, forecasting with regression is an entirely separate topic (talk about lots of "yeah buts"), but it's usually instructive to get our predicted values for the sake of understanding the error distribution from our standard OLS model. To do this, we use the appropriately named "predict()" function, which can be used for a number of other models (such as classification trees) built with R. To easily understand which prediction is which, appending them to the end of the data set can be useful. Also, we can use this to calculate errors simply by subtracting the actual value from the predicted (or the predicted from the actual value). For those that like shortcuts, you can just use the "resid()" function, where the model object is within parentheses.

I'll just use the original "fit.mlb" for this so that the plots are more intuitive later on (on the scale we usually think of Win % in). Lastly, I'll show some code to plot the errors to check for systematic patterns in our errors (which we do not want). If you're familiar with regression, you know that you want to see the errors in the last plot be generally random around 0 across the range of the x-values (W% here).

##predict and append fitted values and calculate errors

league$MLB.pred <- predict(fit.mlb, league, type="response")
league$MLB.err <- league$MLB.pred - league$MLB.attend

##or just simply use "resid()" function residuals.mlb <- resid(fit.mlb)

##plot errors to check for problems
plot(league$MLB.err ~ league$MLB.win, main="Error Distribution (MLB)", xlab="Team W%", ylab="Residual", pch=16, col="steelblue")

abline(h=0, lty="dashed", col="darkblue")


text(league$MLB.win, league$MLB.err, league$MLB.team, pos=1, col="darkgrey", cex=.7)


In the above plots, we can see that the Marlins prediction based on the team quality is way above the actual attendance (in other words, they had crappy attendance considering their relatively decent win percent), while the Mets are predicted well below their actual attendance (they have many more fans than they should, given their terrible quality). Of course, there are some omitted variables in this regression. Overall, there doesn't seem to be too much of a pattern in the residuals. The only problem I see is that the teams at the edge of the win percent distribution tend to be systematically under-predicted. It's really tough to tell a pattern with such a small data set, though. We'll assume everything is just dandy for now, despite the fact that it may not be.

There are a number of other diagnostic options in R for before and after regression, including Quantile-Quantile Plots (using "qqplot()"). I could go on forever about this stuff, but I'll leave it up to you to decide what is useful for you to learn, given your data and interests.

Now that we have fitted values and a regression model object for each league, let's plot our regression line in the scatterplot that we created early on for MLB. This is quite easy using the functions that we've already learned, and for adding a regression line, all that is needed is the "abline()" function. With multiple regression, this won't necessarily work, though, since we have to tell R which predictor variable to plot the line for. I'll cover multiple regression next time. Below I have the scatter plot with the added regression line, along with some simple coding:

###adding regression line to scatter plot

plot(league$MLB.attend ~ league$MLB.win, main="MLB Attendance and Winning",
xlab="W%", ylab="Attendance Per Game", col="steelblue", pch=16, cex=1.4, xlim=c(.35, .65), ylim=c(15000, 50000))

text(league$MLB.win, league$MLB.attend, league$MLB.team, pos=1, col="darkgrey", cex=.7)


abline(fit.mlb, lty="solid", col="black")



******SIDE TRACK******

Before I get to plotting the regression line, I'm going to side track a bit and talk about installing packages. The R network has a package for just about anything you want to do. Thus far, we have only used the base version of R, but there are plenty of free add-ons. For our plots, I want you to go ahead and download a package that will easily plot prediction intervals from your regression.

To begin, go up to the menu at the top of your R window and click:

Packages ==> Install Package(s)

From there, choose a location near you and click OK. Next, you'll be prompted with a big list of packages. Usually, you want to know what each one is before downloading, and you can find this at R's main website. However, just go ahead and download the "HH" package.

R should automatically find a place for this package on your computer so that it can source it later on. In your R script, type:

library(HH)

This will have the package ready to go in R. Be sure to use this command whenever you want to use a function not in the base package for R. Now, let's put this to use.

******END SIDE TRACK******

Here, we'll use the "ci.plot()" function, which is relatively straight forward. Use the "help()" function for any added flexibility you desire. I just discovered this function, and it's an easy way to plot confidence intervals and prediction intervals for your data. If you want flexibility in your plot, you'll want to do this manually (again, I'd recommend Gelmand and Hill's book for learning the basics of calculating these intervals yourself). I'm not going to provide the code here for doing it manually, as the function below works relatively well for quick plotting purposes. In general, I don't like the formatting of the legend, etc. However, it's so quick and easy!

##using ci.plot for regression line plot with 95% confidence and prediction intervals

library(HH)


ci.plot(fit.mlb, xlab="Win %", ylab="Game Attendance", main="ci.plot() Version of OLS Model Plot",
col=c("steelblue"), pch=16, cex=1, xlim=c(.35,.65), ylim=c(15000,50000))



You can see that there is more uncertainty near the edges of the range of our X-values. That is expected, as there are less data at the extremes here. Notice the upward trend line as well. This confirms our intuition, our correlational analysis and our simple regression model coefficient for win percent: Winning and Attendance are positively related.

The "lm()" function works for multiple regression as well. Unfortunately, plots become much more complicated. For this, I'll introduce a scatterplot matrix in my next post, along with a multiple regression that includes an interaction terms and a factor variable.

I'll cover logistic and probit regression in R using "glm()" following the multiple regression example and then finally get to loess smoothing using the "loess()" function and its variants like "gam()" (all very handy tools in sabermetric analysis). For those unfamiliar with the basics of statistics and regression analysis, check out the many tools online for free or R books available at Amazon and other places. I'll again assume that those reading this have some idea of why you would use these tools.

As usual, I have my pretty R code below:

Pretty-R is offline right now. I will post it here when it's back up and running.

3 comments:

  1. Can I just say on behalf of Marlin fans that it's just too hot in Miami to go to these games. Maybe in 2012 when we get the new stadium.:-)
    If i get you average temp and humidity levels by team will you update your regression?

    ReplyDelete
  2. Lol. I believe it. I don't even like going to mid-summer games at Comerica that much. I think adding that in for multiple regression would be fun.

    ReplyDelete
  3. Muy buenos tus aportes, los repetire en español en mi blog.

    ReplyDelete