Thursday, July 14, 2011

Sam Fuld, Bob Carpenter, and Statistical Inference Blog

Here is a quick post responding to a request by Bob Carpenter at one of my favorite nerd blogs: Statistical Modeling, Causal Inference and Social Science. While a lot of the Bayesian theory is out of my league, Dr. Gelman really makes you think about some applied statistical problems in social science.

Anyway, the request was for a quick scatter plot (I'm not going to go nuts and pull out Bugs code for some Bayesian Hierarchical Model or anything like that here!) of batter performance and ability to foul balls off in given counts (I could also do base-out states, but I'll keep it simple for now).

Luckily, I had R up and running with my Pitch F/X database already in. Of course, a full analysis would require understanding where the pitches are thrown that are being fouled off (along with velocity and pitch type), but then it gets a bit complicated. Anyway, here we go. I'll start with a quick table of averages for percentage of pitches fouled off in each count (please excuse the awful table formatting here).


0-1 0-2

17.61% 19.20%

1-1 1-2

20.46% 22.44%

2-1 2-2

23.30% 26.00%

3-1 3-2

21.48% 29.91%

From this, we can glean that guys don't foul the ball much in 3-0 counts. This could be because they see easier pitches to hit and/or they're taking the pitch very often. Probably a combination of both. Keep in mind that these numbers are also biased. We don't see the same batters the same number of times in these different counts. Now for foul percent plotted against wOBA:

If anything, there's a slight downward trend here (as found before at Baseball Analysts, linked at the previous link). And finally, foul percentage plotted against wOBA for each count. Here, I removed outliers (well, outliers defined as 2 standard deviations above the average foul rate), as they should make up most of the players who did not get nearly enough at bats for the foul rates to matter. This didn't work perfectly and there are some obvious anomolies likely due to low plate-appearances, but I think we get a decent look at things. Also, the lower censoring (at 0) makes it more difficult to pick up a pattern in the plots. In addition, the plot includes player-seasons, not just players. So someone like Pujols will be in here 4 times (2007 through 2010):

It might be instructive to look at these same plots only for pitches swung at (so players aren't penalized for being selective at the plate) and/or only on pitches near the edges of the strike zone (so we're just looking at pitches that the players are fighting off). The analysis here doesn't show too much going on, but that doesn't mean there's nothing there.

Below, I've done the latter, with the same plots from above. I define the edge as 8 inches from the center of the plate and/or below 1.8 feet or above 3.3 feet vertically. Of course, you can define the edge in a number of ways. This is rough, quick code and I didn't have time to get into too much detail today:

Keep in mind this is only for Pitch F/X data. That means some of 2007, and all of the 2008 through 2010 regular seasons. I try to wait until the end of the season to update my database each year. I imagine this would be more interesting with even more years of data (like from Retrosheet, as mentioned in the linked blog post). I think Dan Turkenkopf is going to try this out, as he says in the comments. Perhaps I'll extend this later on to the swinging only as well.

Finally, one other thing to look at is whether pitchers really do get frustrated after a long string of foul balls and get burned throwing a pitch down the middle. There is probably a skill somewhere between fouling pitches off and flat out missing those pitches just because a better batter likely make contact more often. But in terms of purposefully trying to foul a pitch off--at least from my own experience playing baseball--I have doubts that guys go up there looking to 'spoil' pitches. To foul a pitch off, you have to make sure it doesn't hit the bat directly, otherwise it would go into play. Hard to believe that in and of itself would be a repeatable skill. To just edge the bat to the ball, you've got a good chance of missing it, too.

This is by no means a deep analysis, and I didn't do any sort of fantastic job at cleaning it up beforehand. Just some fun crosstabs and scatter plots.

Any thoughts from those of you reading this????


  1. Since you asked...

    (0) The lack of a common y axis scale makes comparing the multiple scatterplots nearly impossible. Faceting (ggplot2 or lattice) would be a huge improvement.

    (1) With such a large number of points, detecting trends by eye is very difficult. Adding a trend line to each panel (of some form) would aid immensely. Indeed, it might even make sense to _only_ plot the trend lines, without the points themselves...

  2. Thanks for the suggestions, jme2.

    One reason I had not plotted the trend lines was that this was a 'quick and dirty' exposition as I said before. The trend lines would likely be affected by some significant outliers with very few at bats. I'll take the jabs on any laziness there, and I certainly agree with your point.

    As for the scale of the y-axis, I suppose I could have put them on the same scale. I was simply trying to ensure that we could see the full range of variability in each. The variability of foul percent values is very different across counts, but perhaps I'll repost with your suggestions.

  3. I actually think that the table is clearer than the graphs--maybe because of the y-scale issue.

    While there's no obvious trend for wOBA, the counts are clearly different. Some of the differences probably represent conventional baseball wisdom: taking the pitch on a 3-0 count, and probably taking more first pitches too.

    Other's might be suggestive of when hitters feel pressured to protect the plate and when pitchers feel pressured to throw strikes. We can't be sure from these data, but it would be interesting to see if hitters are swinging and missing more often on 0-2 counts than on 3-2 counts, for example, because the pitcher doesn't feel as much pressure to throw something that's a strike if the hitter doesn't swing.

    Have you given any thought to the application side of this? I wonder what other displays might be informative. Your very first question about whether these are biased makes me wonder if we might see a relationship just between counts and wOBA.

    Fun stuff, that's for sure!

  4. Millsy -

    No jabs intended! Making the plots faceted and adding trend lines (even using robust fitting methods) is only 2-3 lines max using ggplot2, if you're familiar with that package. That's the only reason I pointed it out...

    If you're unfamiliar with ggplot2 and willing to share the data (or a subset of it) I'd be happy to help...

  5. Thanks! This is just what I (Bob Carpenter -- and thanks for the citation) wanted to create.

    jme2's right that adding trend lines is really simple with ggplot2. If x and y are the data to scatterplot, it's just


    You can also control the smoothing method -- the default is loess. We've yet to be successful in getting Andrew Gelman to switch. Even after visits from Hadley Wickham (the ggplot2 developer).

    Those outliers are just what a hiearchical model would help clean up. Andrew and Jennifer Hill's book on multilevel regression is a great overview of these models that doesn't assume that much math. I found having the code for all the models really helped. After reading that, Gelman et al.'s Bayesian Data Analysis is much easier to understand.

    Speaking of Andrew, he brought up the point that it was considered bad sportsmanship to try to tire out the pitcher by consistently hitting foul balls. I don't know enough about baseball ethics to know if he's right.

  6. Thanks guys and gals (wasn't intending that they were jabs, I was giving one to myself). I have not yet ventured into ggplot, mainly because the functions I use elsewhere tend to create pretty plots like you see here and are flexible for my current R abilities:

    There's more code involved, but I do like the flexibility. It has been on my list to play around with for a while, and I like the sleeker look.

    I actually have both the books you mention, but have not ventured into the Bugs code at this point. I've contemplated using it on one of my current projects, though, and have recommended the book to some fellow students. I think the Gelman and Hill book is one of the best straight forward regression books for an applied scientist out there (in fact, I've also recommended it here).

    I'd really be surprised at a specific skill to foul balls off. That would require some pretty insane hand-eye coordination. As someone who used to pitch in college, I would certainly get frustrated when someone kept fouling them off, but certainly not bad sportsmanship in my eyes.