Tuesday, March 22, 2011

Umpire Strike Zones in 2010

I've been working on a new R program that grabs batter-pitcher-umpire level data and creates heat maps for given parameters. My ultimate goal is to create my own function and tool to grab any heat map I'm interested in with a single line of code (sourcing the script, of course). This can be done pretty easily, and below I've presented my first attempt at using my first attempt at the function in a movie format.

For the heat maps presented here, I used the 'mgcv' package in R, which runs a binomial GAM model using cross-validation for the smoothing parameter. This is an important inclusion in writing a program to automate the creation of heat maps, as the variability, range of values, and sample size for pitches is different depending on the player or umpire being modeled. Using cross-validation, we can be sure to use some sort of optimal smoothing parameter given the data at hand for each individual umpire. This version of the GAM model actually uses smoothing splines, rather than a loess function, to smooth. The ultimate result is pretty much the same though.

Anyway, check out the videos below. I'm working on working with swing rates, run values, swinging strike rates, home run rates, ball-in-play rates, etc. for players as my next step. These are a little trickier given the smaller sample sizes for players and hence will likely need to use a standard Gaussian loess function even for binomial data, as there are some serious problems with a GAM model and small samples. I've done this already by umpire, by count. I'm not happy with the result of the loess for binomial strike zone calls, as the smoothing stretches way too far and the sample sizes are very small even for this method. They give the general idea of the relative strike zone changes by count (as J-Doug has been writing about at Beyond the Boxscore), but the visual is just misleading with respect to the actual strike zone size.

I've got a few ideas for this stuff, which I may advertise a bit later because I'll need some help to implement any of them. For now, enjoy the little slide shows below. Sorry I didn't provide each PNG file for your own inspection, but there are 78 umpires included in the data set (I removed some with extremely small sample sizes from 2010). Of course, I'm always happy to contribute some visuals to your website if you are interested in these.

In the videos below, the order of the umpires should be the same. Therefore, if you quickly click each one right after the other, they should start at about the same time and you can view RHB and LHB zones for the same umpire at the same time as it scrolls through.

I apologize for the crappy resolution in the videos. Apparently when it was converted it really messed with the quality of the images.

ANOTHER UPDATE: Thanks to the ability to embed a video from Facebook, I was able to improve on the resolution. Hooray for Facebook!


  1. Wow, that's really impressive. I'm actually trying to do something similar but what you have here is way better. If at all possible, can you explain the general process that you used? Did you create these videos within R?

    Again, great stuff

  2. Just Windows Movie Maker after writing a loop to create a PNG for each umpire and each batter handedness in R. It's really just a slide show. I'm a little pissed at Blogger for the video formatting, as they look nice and crisp on my computer. So I might load them up on my personal site.

    You can do animation in R with a simple loop, but when sticking it into an external file it's generally in GIF format or something, which ends up being pretty crappy (and I can't get the function to work anyway). Movie Maker literally takes 2 minutes with click and drag once the files are created. I'll have some more of these up here and possibly at Fantasy Ball Junkie as well.

    The real issue is the bandwidth and sample size. I'm sure you've run into these problems working with the smoothing in your own work (like the Cano article, btw). You can always just eyeball it and create them one by one on your own subsets of Pitch F/X data. But if you're changing bandwidths by eyeball and manually outputting each graph, you can run into biased visuals or misrepresent the data--often times it ends up that the heat maps aren't comparable in these cases as well. That's where the MGCV package comes in, cross-validating the smoothing parameter dependent on each data set for each umpire.

    The down side is that GAM models need a lot more data, so breaking things down to samples of less than 1,000 ends up being a problem. I'm working on finding a good way to optimize span for standard loess. Oh, and be sure not to use the 'gam' package anymore. It doesn't seem to interface correctly with the newest version of R. The gam function in the 'mgcv' package seems to work fine though.

  3. Gotcha. I was able to do the loop and automate the process of making a million PNGs. I was just hoping there was a way to combine all the PNGs into a video or .gif in R in an efficient manner.

    And it's good to know about the MGCV package. Sounds useful. I know I'm guilty of using the eyeballing technique to check smoothing.