Sunday, December 5, 2010

Rethinking 'loess' for Binomial-Response Pitch F/X Strike Zone Maps

So after a long hiatus, I'm back for today. I've been crazy busy with a number of different things--including getting engaged and helping plan out wedding dates and things of that sort--and unfortunately have not kept up here on this blog (or on Fantasy Ball Junkie, though we're waiting for the Fantasy Baseball Preseason to start up before posting much more). Teaching doesn't help, and I've been working on a few papers just submitted for review. Add in taking a class, doing some consulting work, and planning trips to Greece, Napa Valley, Canada, home to Maryland for X-mas and Thanksgiving, and possibly one to San Diego, The Prince of Slides has taken a backseat. Oh yeah, and I guess I should get writing that dissertation thing, but I'll procrastinate here for a bit...

Now that I'm done complaining, here's a relatively non-game-related post (it's more on assumptions behind our baseball data)...

I work a lot in R. If you've checked out previous posts here and at FBJ, almost all the analysis is exclusively done in R. I've been contemplating shifting my focus to how-to's for Sabermetrics with R. Unfortunately for you guys, I'm not Dave Allen, nor do I have as much time to devote to bringing you saber-type pieces. I've enjoyed the R-Bloggers site that feeds a number of these open-source types of blogs, and I think it's a great resource (hint, hint: go check it out). That's still up in the air, but either way, I think for any R-related posts, I am going to try and have the RSS Feed go there if they think I'm worthy of it.

Anyway, I had been working on improving my heat maps using a loess, rather than a 2-dimensional kernel density smoother. The plot scaling for smoothScatter does not work right and they are limited to density only--see my previous post or my IIATMS contribution for those plots. With loess , I could add a third variable to the mix represented by the heat colors (thanks to Dave Allen for his presentation from the Pitch F/X summit a couple years ago, as his Powerpoint slides sent me in the right direction to actually get 'filled.contour' to work in R--I still can't fathom as to why there isn't a simple 'plot.loess.2d' function in R that does this without the extra code and 'predict' data matrix creation).

I've been working with umpire data recently for strike zone maps (probability of a strike call), but these are applicable to pretty much any variable you want to plot with respect to location in the zone. But there's a problem/thought I came across (and I'll tell you what the true problem is a little later).

If you're not familiar with 'loess', it's just a regression, but it is not constrained to be a straight line or plane through the data. In other words, it allows wiggles to fit the data more closely. Of course, when using non-parametric methods, some things can be left up to the person running the analysis, one of them being the 'span' used for the loess smoothing.

Span is the width of the--smoothing or number of points/distance away depending on the type of function you're using--you can also use splines and nearest neighbor type smoothing methods--included in the smoother at each prediction point. In other words, if the span is 1 (in the context of how R defines it), then it will probably be smoothed too much. If the span is closer to zero, it becomes more (too) granular and doesn't give a smoothed representation of the strike zone. It's purely up to the researcher what the span will be and really depends on what you want to get out of the graph (though, there are some suggested rules of thumb and other ways to determine it 'optimally'). Essentially, you want to balance the fit with simplicity in presentation here.

What do I mean by 'smoothed too much'? Well, if you use the entire data set to calculate the average probability for every point, irrespective of it's location, the whole thing will be one color. On the other hand, if you use only each single data point to describe whether or not that pitch will be called a strike, you may as well just make a scatter plot. The idea is to fit the data well, but be able to summarize it more clearly and concisely.

I began using Dave Allen's code with the parameters he had chosen. If you look below, I have the strike zone maps for Bruce Froemming in 2007. This uses a standard loess smoother in R with a span on 0.6. But look closely at the borders of the Heat Map.

I have a hard time believing that the strike zone for an umpire looks like it does in these plots: higher probabilities of points well outside the strike zone than those closer to it. There seems to be a bit too much smoothing going on with my current span, so I fiddled around with it for a while and something hit me: for Strike Zone Maps, we don't have a normal distribution, but rather a binomial distribution. The pitch is either a strike (1) or a ball (0). While this isn't a huge problem in general--we can run linear probability models instead of probits/logits after all--it can be misleading when looking at strike zone maps (and you also have to go back and rescale your variable before you plot it, or you'll get probabilities above 1 and below 0).

(**SIDE NOTE: Dave's plots don't show this, from what I've seen, so the specifications he has in the presentation are almost certainly different than he uses for the strike zone maps I've seen. And I think his work is awesome. Finally, there are alternative solutions to fixing up the plots while still using a standard loess, but this post is simply to inform you of some other options).

Another option would be to reduce the span/increase the span (but then we'd either get a dot in the upper right OR make the strike zone in general look too large/smoothed out).

Finally, we could change the degree of the polynomial we use. The default is to use a second degree polynomial, which results in allowing more 'wiggle' in the loess. If we constrain it to 1, the plots look at little bit better. This is good to know, and it seems that generally, the loess smoothing works well if the parameters are chosen correctly.

Anyway, with a standard loess in R, here is what we see (span = 0.6, scaled for the ratio of the height vs. width, so the span for horizontal location is a little smaller):



Now, there are obviously a few outliers or problems in the data (see the one point in the top right below int he scatter plot), and that's likely why we get a little green at the bottom-left and top-right of the plots.





But remember we fit the original plots without bothering to think about the distribution of our data. Loess generally doesn't cover binomial data. Again, it may not be a huge deal, but it can affect some things we look at, and we'll have to bound/recode our data from 0 to 1 after we predict it (or rescaling it to 0, 1 which are actually different things--the latter would probably be preferred, but my code below does the former).

This means that we may not be getting a very good idea of the data points within the strike zone. If the ones right down the middle have a predicted probability of 1.40 using the loess, while the ones closer to the edges are at 1.10, we might want to know that these are not the same thing. But we artificially bound the data. Another option is to simply bound the color key on the right, but then points above 1 and below 0 are not included in our color coding! What to do?

Luckily, R has a package called 'gam' (Generalized Additive Models) that allows us to fit a loess regression using the binomial family and a logit link function similar to the glm package. What happens when we do this (using the same span of 0.6 for height and scaled accordingly for width)?

Here's what we get:





















This seems like a much better-behaved strike zone map than the one using a standard loess regression. While in general, we see fairly similar results, there could be some misleading observations based on the family of distributions that we decide to model the data. Ultimately, the choice of your model and the parameters you stick in there will be important for your visualization, but I'm not sure it's make or break. I guess the thesis of this post is to make sure you think about your data first.

However, I'm curious what others think of using the 'gam' package for this. It looks like it makes the plot a perfect mirror image above and below the 2.5 foot line (draw a horizontal line there and flip it). I'm not certain why it does this, which makes me want to go back to the drawing board. I may try to simulate my own extremely non-symmetric data and see what happens along the horizontal axis. The last I heard, the 'gam' package was somewhat experimental, so perhaps using this is a bit out of date (anybody know). Feel free to take my code and fiddle with it! Change the degree, change the span, change the colors, then tell me mine are stupid and yours are way better!



I would love to hear suggestions about this. My first thought is that I'm missing something with the 'gam' package that results in the 'mirror image' look. So what do you think? Does the 'gam' package or the standard 'loess' give a better view of what's really going on in that scatter plot above? I'm happy if you want to tell me I'm totally wrong on this as well or of other packages that can do similar things for binomial data. For more continuous data, loess should work just fine (like run values, though that's technically not a continuous variable if we're using some category of run values like "Man on 1st, Two Outs, Single to Center Field" type of data...but that's getting nit-picky and I don't think it matters much). Like I said before, messing with the Span and changing the polynomial degree in the loess function generally makes things better. But we still want to model the data correctly, right?

Another possibility I'd like to play with is using smoothing splines to see if there are any differences. Both 'gam' and the 'mgcv' package, I think, allow you to do splines rather than loess. I imagine the results will be pretty much the same. And since we're not toooooo worried about testing hypotheses with this plot, I'm not extremely worried about the error distribution. The main worry is at the edges of our plot, where things may be overestimated without bounding the data with the logit function. Just something to keep in mind.

I have provided the code below for producing the maps using All of the data, keeping in mind that my data is structured from Mike Fast's code and I created a numeric representation of the 'type' variable called "call_type" where Strike=1 and Ball=1. You can always select the data specifically for RHB, or LHB, or pitch-type, or whatever you want! My data are pre-filtered in SPSS to only include Bruce Froemming's 2007 data for CALLED pitches.

Another request if anyone bothers to read this blog: I was wondering if there is an easy way to append catcher ids to the pitch database structure developed by Mike Fast (Mike, if you're out there, please help!). I looked through the XML files from the Gameday database, and it looks like they have Starting Catcher in the player file, but figuring out who is catching Pitch-by-Pitch is not as straight forward as for the pitcher and batters that are included int he inning files.

To do it from the XML, it looks like there would have to be a search in the play-by-play to see when a catcher was switched in or out, and I'm not sure it's even possible to set up some sort of macro to do this, because "Defensive Replacement" has a generic code, so only identifying catchers would be tedious. BUT, I know that there are people out there with catchers appended to their Pitch FX data, so maybe it's not as difficult as I'm making it out to be.



Code (please excuse the crappy Blogger formatting, I'm not an HTML guy so I don't have much control over this and the standard text editor does not let me indent code (ARGH!); this is for the general plots, but the RHB/LHB work just the same, you just need to subset the data):

Correction: found a way to insert colored R code using "Pretty R" html. Just copy and paste, and you're good to go. Still not perfect below (the html tool left out a couple parenthesis, so I'm sorry about that), but it should be helpful when trying to read it!


ump <- read.csv(file="umpire_10.csv", h=T)

head(ump)
attach(ump)

#use loess and filled.contour

sz.width <- 0.708335 - (-0.708335)
sz.height <- mean(ump$sz_top)- mean(ump$sz_bot)
aspect.ratio <- (max(px)-min(px))/(max(pz)-min(pz))

fit <- loess(call_type ~ px + pz, span=c(0.5*aspect.ratio, 0.5), degree=1)
myx <- matrix(data=seq(from=-2, to=2, length=30), nrow=30, ncol=30)
myz <- t(matrix(data=seq(from=0,to=5, length=30), nrow=30, ncol=30))
fitdata <- data.frame(px=as.vector(myx), pz=as.vector(myz))
mypredict <- predict(fit, fitdata)
mypredict <- ifelse(mypredict > 1, 1, mypredict)
mypredict <- ifelse(mypredict <0, 0, mypredict)
mypredict <- matrix(mypredict,nrow=c(30,30))

png(file="FroemmingAll.png", width=600, height=675)
filled.contour(x=seq(from=-2, to=2, length=30), y=seq(from=0, to=5, length=30), z=mypredict, zlim=c(0,1), nlevels=50,color=colorRampPalette(c("darkblue", "blue4", "darkgreen", "green4", "greenyellow", "yellow", "gold", "orange", "darkorange", "red", "darkred")), main="Bruce Froemming Strike Zone Map", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)",
plot.axes={
axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")
axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")
rect(-0.708335, mean(ump$sz_bot), 0.708335, mean(ump$sz_top), border="black",
lty="dashed", lwd=2)
},
key.axes={
ylim=c(0,1.0)
axis(4, at=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), labels=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), pos=1, las=0,
col="black")
})
text(1.4, 2.5, "Probability of Strike Call", cex=1.1, srt=90)
dev.off()

############trying to use the 'gam' package instead of loess because data is binary

library(gam)

####all batters

attach(ump)

fit.gam <- gam(call_type ~ lo(px, span=.5*aspect.ratio, degree=1) + lo(pz, span=.5, degree=1), family=binomial(link="logit"))
myx.gam <- matrix(data=seq(from=-2, to=2, length=30), nrow=30, ncol=30)
myz.gam <- t(matrix(data=seq(from=0,to=5, length=30), nrow=30, ncol=30))
fitdata.gam <- data.frame(px=as.vector(myx.gam), pz=as.vector(myz.gam))
mypredict.gam <- predict(fit.gam, fitdata.gam, type="response")
mypredict.gam <- matrix(mypredict.gam,nrow=c(30,30))
png(file="FroemmingAllGAM.png", width=600, height=675)
filled.contour(x=seq(from=-2, to=2, length=30), y=seq(from=0, to=5, length=30), z=mypredict.gam, axes=T, zlim=c(0,1), nlevels=50,
color=colorRampPalette(c("darkblue", "blue4", "darkgreen", "green4", "greenyellow", "yellow", "gold", "orange", "darkorange", "red", "darkred")),
main="Bruce Froemming Strike Zone Map (GAM Package)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)",
plot.axes={
axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")
axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")
rect(-0.708335, mean(ump$sz_bot), 0.708335, mean(ump$sz_top), border="black",
lty="dashed", lwd=2)
},
key.axes={
ylim=c(0,1.0)
axis(4, at=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), labels=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), pos=1, las=0,
col="black")
})
text(1.4, 2.5, "Probability of Strike Call", cex=1.1, srt=90)
dev.off()

6 comments:

  1. sweet post
    like how you displayed all the code you used

    ReplyDelete
  2. Thanks, rz. Check out the 'Pretty R Tool' link on the sidebar. It works pretty well, but left a character out here and there (and apparently, I should not have named my data 'ump', as it is an R function!).

    ReplyDelete
  3. Hi there,

    Is there a way to get your umpire_10.csv file so I can reproduce this example? I'm personally interested in detecting / dealing with outliers and these data could provide a very cool example.

    Thanks,

    Matias
    msalibian@yahoo.ca

    ReplyDelete
  4. Matias,

    I went ahead and sent you the data. If you have any suggestions, I would love to hear them...especially given your expertise.

    Any thoughts on moving from 'gam' to your 'rgam' package (is it available yet on CRAN?).

    ReplyDelete
  5. Hello Sir, I'm a fan of MLB and Pitch f/x data.
    I looked many sites but your heatmap is one of the best I could follow.
    But like Mr.Matias up there, I'm very curious on variable "Call_type".

    It would be very grateful for me to get a umpire_10.csv or any csv file that contains call_type.

    Thank you,

    Minkoo Song
    admin@pitchfx.kr

    ReplyDelete
  6. Hi Minkoo,

    I seem to have misplaced this data file somewhere. However, you can find all the information about downloading the data at Mike Fast's old blog Fastballs. You can also get data downloads from other sites around the web.

    ReplyDelete