Wednesday, May 11, 2011

sab-R-metrics: Basics of LOESS Regression

Last week, I left you off at logistic regression. This week, I'll be pushing the limits of regression analysis a bit more with a smoothing technique called LOESS regression. There are a number of smoothing methods that can be used, such as Smoothing Splines or simple Local Linear Regression; however, I'm going to cover LOESS (loess) here because it is very flexible and easy to implement in R. Remember that here, I'm not going to cover too much of the quantitative portion of the methods. That means that if you plan on using loess in your own work, you should probably read up on what it is actually doing. I'll begin with a brief, non-mathematical description.

When we ran regressions using OLS procedures, there is an assumption that the relationship between the X and Y variable is monotonic and constant across the domain and range of each variable (i.e. that as X increases, Y also increases--at the same rate for all X and Y). Of course in the real world, this is not always the case. You can use polynomials in linear regression to address the issue, but sometimes other methods may be necessary. This is where smoothing comes in.

Using smoothers, there is no restriction on the functional form between X and Y with respect to intensity of the relationship, or direction (positive or negative). Of course, this means our fits are a bit more computationally intensive. And if not careful, it is very easy to overfit the data by trying to include every wiggle we see. But if done properly, one may be able to glean some extra information from the data by using a smoother instead of a restrictive linear model.

So what is loess? Well, as I said there are a number of smoothers out there. The advantage of loess (with its predecessor 'LOWESS') is that it allows a bit more flexibility than some other smoothers. The name 'loess' stands for Locally Weighted Least Squares Regression. So, it uses more local data to estimate our Y variable. But it is also known as a variable bandwidth smoother, in that it uses a 'nearest neighbors' method to smooth. If you are interested in the guts of LOESS, a Google search should do you just fine.

As usual, there is a nice easy function for loess in R. The first thing you'll need to do is download a new data set from my site called "guthrie.csv". This is a Pitch F/X data set from Joe Lefkowitz's site including all pitches by Jeremy Guthrie from 2008 through 2011 (as of May 11, 2011). If you are a Baseball Prospectus reader and ran across Mike Fast's most recent article, you'll understand why I think this is a nice data set for implementing loess...or at least you will by the end of this tutorial.

Once you have the data, go ahead and set your working directory and load it in. I'm naming my initial version of the data "guth":

####set working directory and load data

setwd("c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics")

guth <- read.csv(file="guthrie.csv", h=T) head(guth)

Now because I plan on working with pitch velocity data today, I want to make sure we're including pitches of a certain type. For this reason, I want to go ahead and subset the data into only fastball variants thrown by Guthrie over this time period. That way, the data aren't contaminated with change-ups and curveballs of lower velocity. We want to look specifically at arm strength. The following code should subset the data correctly. Remember that the "|" means "OR" in R.

##subset to just fastballs and fastball variants

guthfast <- subset(guth, guth$pitch_type=="FA" | guth$pitch_type=="FF" | guth$pitch_type=="FC" | guth$pitch_type=="FT" | guth$pitch_type=="SI")

Now, because we want this to be an ordered series of pitches across time, we'll have to create a new variable to represent this sequence. For this, we'll make use of a new function that comes very much in handy when working with and visualizing smoothing analyses. It is called "seq()", and it creates a sequence of numbers. The first command below "from=" indicates the starting point of your sequence, and "to=" represents the endpoint. Finally, "by=" tells R the space between your points. You can put any number into these that you want. The smaller the "by=", the more points you will have. I'm going to keep it simple and use "by=1", so that we'll have a count of the pitches.

##create a time sequence for the pitches

guthfast$pitch_num <- seq(from=1, to=length(guthfast[,1]), by=1)

head(data)

You'll notice in the "seq()" function above, I tell R to count up to the length of the dataset. By typing "length(guthfast[,1])" I am indicating that I want the number of rows (i.e. the 'length' of the first column in our data set). This way, if we count by 1, we know that every pitch will have a sequential integer-valued number in the Pitch Number variable we appended onto the data set.

Now that we have this set up, let's take a quick look at Guthrie's pitch velocity over time. The code below should plot each pitch's speed as a function of the pitch number:

##plot all fastball pitch velocity by the pitch number variable

plot(guthfast$start_speed ~ guthfast$pitch_num, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Fastball Count (2008-2011)", main="Pitch Speed by Sequential Fastball")



Looking at the plot above we can see some semblance of a pattern, but the data seem to be too noisy to really see what is going on. Perhaps if we use the average for each game we'll be able to see something more useful. For this, we'll make use of the "tapply()" function again. If you don't remember this function, head back to the earlier sab-R-metrics posts and check it out. This function allows us to quickly take the average fastball velocity by game using the following code:

##get mean fastball velocity by game

aggdat <- tapply(guthfast$start_speed, guthfast$gid, mean)

aggdat <- as.data.frame(aggdat) head(aggdat)

This function spits out a vector with the game id's as the row names. However, we convert it to a data frame using the function "as.data.frame()" so that we can use our standard object calls and variable names. Unfortunately, you'll see that we don't have the right variable name. It's just called "aggdat" for the average velocity for each game. We can use an easy function in R to fix this up. But first, let's append a count of the game numbers just like we did with the pitch numbers so that we have them sequentially over time:

##create game numbers

aggdat$game_num <- seq(from=1, to=length(aggdat[,1]), 1)


##change column names

colnames(aggdat) <- c("start_speed", "game_num") head(aggdat)

As you can see, our column names are what they should be now. Using this new data set, let's again plot the fastball velocity over time. Note that we reduced the data to only 101 data points (games):

###plot average velocity by game across time

plot(aggdat$start_speed ~ aggdat$game_num, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Game Number (2008-2011)", main="Pitch Speed by Sequential Game")


Here, we see that there are many fewer data points than before. But it's still a bit tough to understand any pattern going on. We could start with an OLS model to fit the data linearly. Using the code below, you should get output that tells you Guthrie's fastball velocity is decreasing over time and that this is significant at the 1% level. But is that really the case? Once the model is fitted, go ahead and plot the regression line using the second bit of code. We'll save the standard errors from the model for later on.


##fit a linear model to the data

fit.ols <- lm(aggdat$start_speed ~ aggdat$game_num) summary(fit.ols)

pred.ols <- predict(fit.ols, aggdat, se=TRUE)


##plot the regression line on the data

plot(aggdat$start_speed ~ aggdat$game_num, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Game Number (2008-2011)", main="Pitch Speed by Sequential Game")

lines(pred.ols$fit, lty="solid", col="darkgreen", lwd=3)


We see the negative slope in the plot above, but do you think this is the best representation of the data? Using a loess regression, we may be able to improve on this. Unfortunately, the drawback from the loess is that there isn't really a clean functional form like we get from OLS. That means no real 'coefficients' in a nice Y = mX + b form that we learned in algebra class. For the most part, the best way to use loess is to look at it.

So, to fit a loess regression, we'll go ahead and stick with the game average data for now. Using the "loess()" function (doesn't get much easier than that!), we can apply the new analytical tool to our data. Let's try some basic code first. Below, I estimate the loess using a default smoothing parameter and then predict values and plot it just like with the OLS model. Only this time, it looks a bit different.

##loess default estimation

fitd <- loess(aggdat$start_speed ~ aggdat$game_num)

my.count <- seq(from=1, to=101, by=1)

predd <- predict(fitd, my.count, se=TRUE) plot(aggdat$start_speed ~ aggdat$game_num, pch=16, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Pitch Count", main="Pitch Speed by Pitch Count")

lines(predd$fit, lty="solid", col="darkred", lwd=3)


In the code above, I kept things pretty basic. Usually, we want to use the argument "span=" in order to tell R how much smoothing we want. The larger the span, the more points that are included in the weighted estimation, and the smoother the plot will look. Something to be careful with here, though, is making the span too small. We don't want to over fit the data, but want it to give us some idea of the pattern that we see. If we fit every point on its own, we may as well just look at a scatterplot. On the other hand, if we smooth too much, we may as well just estimate an OLS regression. The idea is to find a balance between the two using the smoothing parameter.

You can also identify a polynomial, which allows for more 'wigglyness' in your loess. For our purposes, I'm not going to bother with this. However, if you are fitting something that you believe needs some serious wiggles, go ahead and fiddle around with the "degree=" argument in the "loess()" function. For the most part, I would not recommend going over 3 for the polynomial as you'll likely be bordering on over-fitting--but some data might well need further polynomials. The default in R for this function is "degree=2", and you can change it to "degree=1" if you like and you'll see your wigglyness--for the same span--decrease a bit. It all depends on your data. For the purposes of this post, we'll just 'eyeball it'. However, there are other ways to optimize smoothing parameters in loess and other smoothing (and density estimation--see next week) methods. These include "rules of thumb", cross-validation methods, and so on. The default span in this function in R is 0.75, but it should really depend on your data.

Now go ahead and add a parameter in your loess code. For this, just include the additional argument "span=" within the "loess()" function. Play around with it and see what happens. I'd say just work with values between 0.1 and 1.0. The code might look something like this:

##fiddling with the span

fit3 <- loess(aggdat$start_speed ~ aggdat$game_num, span=0.3)

pred3 <- predict(fit3, my.count, se=TRUE) plot(aggdat$start_speed ~ aggdat$game_num, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Sequential Game (2008-2011)", main="Pitch Speed by Pitch Count (Span=0.3)")

lines(pred3$fit, lty="solid", col="darkred", lwd=3)


Below I've embedded a quick video that shows how the loess line changes when we increase the span by 0.1 each time. I won't provide the code for the movie (it's just a repeat of the same code over and over, and I made the movie in Windows Movie Maker with still images). In the future, I'll be sure to get into 'for loops' to generate multiple versions of the same plot while incrementally changing a given parameter, but that's advanced for this post. Watch through the video and try to pick out what you think is the best representation of the data that does not over fit (over-wiggle).



You might be saying to yourself, "Well, it looks like the span of 0.6 smooths the best." If you said that, then I'd agree. Of course, if you disagree, that doesn't make you wrong. The default looks pretty good as well (with a span of 0.75). Remember we want to get to a balance of fit and smooth and we're just eyeballing it. While it's somewhat subjective in this case, I imagine that we would all come somewhere near a consensus on the range of acceptable smoothing.

You may also be saying to yourself, "How do I get those cool intervals?" If that's the case, then you're in luck. When we used the 'predict()' function earlier, I made sure to tell R to keep the standard errors from the loess. This allows us to plot a standard 95% Interval on our plot. We'll have to make use of some new functions here. First, let's create two sequenced data sets for our predictions and interval construction:

##interval construction stuff

my.count <- seq(from=1, to=101, by=1)
my.count.rev <- order(my.count, decreasing=TRUE)

The second line simply reverses the order of the first. So each of these are vectors of the x-variable (game number) in increasing and decreasing order, respectively. From here, we can go ahead and re-plot our span=0.6 version of the loess and add dashed lines for the confidence intervals using some basic math. After this, we'll add some fill color and apply what we learned about the RGB color scheme in the Bubble Plots tutorial.

##fit the span=0.6 model

fit6 <- loess(aggdat$start_speed ~ aggdat$game_num, span=0.6)

pred6 <- predict(fit6, my.count, se=TRUE)


##now plot it


plot(aggdat$start_speed ~ aggdat$game_num, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Sequential Game (2008-2011)", main="Pitch Speed by Pitch Count (Span=0.6)")

lines(pred6$fit, lty="solid", col="darkred", lwd=3)



##now add the confidence interval lines


lines(pred6$fit-1.96*pred6$se.fit, lty="dashed", col="blue", lwd=1)
lines(pred6$fit+1.96*pred6$se.fit, lty="dashed", col="blue", lwd=1)



You can see here that we use the 1.96 as an approximation of the 95% interval. In the plot above, we see the interval represented by the blue dashed lines. However, I really like the filled interval look. For this, we'll need to use the "polygon()" function and the code below.

The first portion of the code below tells R that we want to create an outline of a polygon on the y-axis with the confidence bounds at each point along the two vectors we created above. The second line just recreates the x-axis, but in increasing then decreasing form to get full coverage of the shape. Finally, we create a shape using the confidence bounds and fill it with a transparent color (otherwise it will cover up everything). The #00009933 indicates that we want it completely blue (99 in digits 5 and 6), with some transparency (33 in digits 7 and 8). As long as your plot is still up, this code will simply add

###create polygon bounds

y.polygon.6 <- c((pred6$fit+1.96*pred6$se.fit)[my.count], (pred6$fit-1.96*pred6$se.fit)[my.count.rev])

x.polygon <- c(my.count, my.count.rev)


##add this to the plot

polygon(x.polygon, y.polygon.6, col="#00009933", border=NA)

Notice how the interval bands flare out at the end. This is because there is less data at the endpoints (i.e. beyond the enpoints) of the data. Therefore, there is less certainty about the prediction here. This is a common problem in any regression (including linear regression), but are exacerbated in smoothing because of the ability for a single point at the edge of the distribution having too much influence on the direction of the endpoints of the smoothed line. Just something to be aware of.

Lastly, if you're feeling lazy, you can always just use the "scatter.smooth()" function. This will automatically plot your loess, and takes the same arguments as "loess()". However, here you simultaneously provide the plotting parameters. See the code and plot below:

###show how scatter smooth just does the plot automatically

scatter.smooth(aggdat$start_speed ~ aggdat$game_num, degree=2, span=0.6, col="red", main="Scatter Smooth Version (Span=0.6)", xlab="Sequential Game Number", ylab="Starting Speed (Fastballs, MPH)")


The above plot isn't as pleasing to me as the ones I made manually. In general, by doing things manually you will have more control over the look of things, but if you're looking for something quick, "smoothscatter()" does just fine. One thing to remember, however, is that it has a different default for the polynomial than the "loess()" function, so if you want the same fit, you'll have to tell R that "degree=2".

There are other options in loess that I haven't covered today, including the ability to parametrically estimate some variables, while applying the loess function to others. This can come in handy if you think only some variables are non-linear, while others are linear in nature. If this interests you, definitely check into it. Other packages, like mgcv, allow for similar model types using slightly different smoothers and an extension of the smoothing function to Generalized Linear Models (binomial response, etc.). Hopefully this will give you a nice base to work with loess regression in your own work, but keep in mind that these tutorials are not a replacement for understanding the underlying mathematics that create the pretty pictures. Loess can be terribly misused in the wrong hands (especially with pitch-location smoothing), so it is important to understand WHY you are doing certain things, not just HOW to do it in R.

I don't currently have Pretty-R code up and running, as the Blogger HTML really effs with embedding it in here. All of the necessary code is included above (and remember if you want to save plots and pictures, use a graphics device like "png()").

10 comments:

  1. NOTE: Some of the plots say "Pitch Speed by Pitch Count", these aren't titled correctly. However, the AXES are titled correctly. My bad!

    ReplyDelete
  2. Millsy -- great stuff

    ReplyDelete
  3. This has been a very instructive post, and i look forward to gong through all your R posts. One thing i can't figure out is where the assumed 1.96 value for the approximate 95% confidence interval comes from. any guidance would be appreciated.

    ReplyDelete
  4. Thanks for stopping by, Mike.

    The 1.96 value is just for calculating the 95% confidence interval from the Standard Error for non-tiny data (really, you can just use 2). It comes from the z-table for the standard normal distribution.

    Here is a source that does a similar calculation:
    https://stat.ethz.ch/pipermail/r-help/2008-August/170011.html

    ReplyDelete
  5. Thanks for the post. I'm not very familiar with loess. I was wondering why you did not use the loess.predict function in your example.

    ReplyDelete
  6. From what I can gather, "predict" is a general function for models. "loess.predict" is simply the specific reference to the use of "predict" when implementing "loess".

    I don't know that R actually recognizes "loess.predict" itself as a function.

    ReplyDelete
  7. Sorry if my question is too silly.
    I have only 10 replicates and I want to do a regression, but my data is not linear. Do you think I can do a loess regression with this very reduced n?
    I also don't know how to select the best model for my data, any hint?
    Thanks for your post, amazing!

    ReplyDelete
  8. I would not run a loess regression on 10 points, but technically you *should* be able to "do" it. My guess is that you would be over-fitting your data.

    In selecting a "best model" it depends on what you are trying to do. If you mean by the optimal smoothing, you can use the mgcv package which uses cross-validation and smoothing splines (or loess if you like). For model selection outside of non-parametric models I would suggest checking out your many available regression books (Gelman and Hill's Regression and Multilevel Modeling book comes to mind, specifically for R).

    ReplyDelete
  9. Any thoughts on doing this analysis for batting average? My thought is that you need a logistic form of LOESS. Is that a thing or am I crazy?

    ReplyDelete
  10. Hi Evan, I would use generalized additive modeling for this. I do it for strike calls. See this post:

    http://princeofslides.blogspot.com/2013/07/advanced-sab-r-metrics-parallelization.html

    ReplyDelete