Monday, January 31, 2011

Fangraphs Heat Maps Using R?????

Fangraphs has a new capability with Pitch F/X data that I was hoping they would provide at some point in the near future. It looks like they're using the R function smoothScatter which I worked with here at my blog a while back. I also presented these at IIATMS. Some others have highlighted the use of the function for Pitch F/X analysis (including Dave Allen in a Fangraphs post and Harry Pavlidies in a THT post). Some notes on this function and the Fangraphs page:

1. I do like the color scheme on the plots.

2. As noted in the comments, there need to be some axis labels.

3. One of the things I disliked about the function was the difficulty with setting my axes correctly. Others have noted this problem when trying to use my R code through emails to me.

4. The smoothing parameter doesn't seem to work well for outlying points in the function. Therefore, you end up with dots on the outer edges, which kind of disrupts the point of the heat map display.

I was wondering if Fangraphs would implement this with Dave Allen and Albert Lyu on the staff there. I'm not sure if they used some of the basic code from my site, but it would be kind of cool. One of the things I wondered about sharing my ideas, releasing R code and building tutorials here for free was if it would reduce any sort of competitive edge I had knowing R and being able to reproduce things that Dave and Albert (and Jeremy Greenhouse) do at Baseball Analysts and Fangraphs. Interestingly, it has done the opposite and I've had some fantastic inquiries about the things I post here. I look forward to any improvements in the heat maps.

I think the Fangraphs feature is a great tool, but I am monitoring a conversation on Twitter that indicates some hardcore analysts are worried about the repercussions of non-experts using the maps. There is a lot of work to be done with respect to bias in data based on a number of factors. Mike Fast has highlighted this in the past. They're great to look at, but I agree that making too many inferences becomes dangerous. And these are simply location maps, which leaves the possible inferences to a minimum. Hopefully those looking at them understand this.

Addendum: Dave Allen tells me that R may be a bit slow for feeding these things through at Fangraphs. So I'm actually not sure now if R is being used.


  1. Completely agree, Millsy. I'll enjoy using these for quick lookups, but I do agree with Colin, Dan Turk, etc. that people may take graphs and misuse them in their posts. But I think that's OK, because people will get corrected, which will inspire posts like Mike Fast's THT article. And for the most part, people who slap on their own graph misinterpret will be found out.

    Even besides the smoothScatter function, there are plenty of reasons to be wary about making heat maps where the smoothing parameter can overexaggerate hot spots... the more I use heat maps with R for my articles, the more I realize how hard it is to represent it accurately.

  2. Albert,

    I've had a lot of trouble implementing the heat maps for data that is anything less than about 5,000 pitches or so. This is especially true for the 'gam' models for binary data. For example, modeling probability of putting a ball in play. If a player made contact with a single pitch 2 feet outside the strike zone, the function goes to sh*t in the visualization and it's all red way outside, while all blue in the middle of the plate (because it models a 100% contact rate in that one area).

    In sum, I've found the smoothing to be not as straight forward as we'd like with smaller data sets.

    I actually received some inquiries about using my data and R-code from a statistician working on a version of GAM ('gam' package in R) that is robust to outliers. Matias Salibian-Barrera at UBC has code for an outlier-robust GAM model here:

    Unfortunately, the CV for the bandwidth doesn't work for 2-dimensions yet and when I tried to use the function my computer went haywire.