A recent post by Mike Fast got me thinking a bit more about color. For most, thoughts about color generally become a secondary interest. But I am here to tell you they should be a primary concern in your statistical presentations. This is especially true when analyzing the strike zone.
Before you begin reading this, please read Mike's excellent post over at Baseball Prospectus. Then, go ahead and read this article at Praiseball Bospectus (linked not because of its title--I am very glad Dave Allen was born--but because it really does highlight some issues with things you'll find around the net).
Okay, now that you have read that, here are my additional comments. First, heat maps should be approached with caution. This is true whether or not you are smoothing or simply breaking up the zone into smaller areas. Mike covers this well, but I will take it a little further with smoothing.
When you use a smoothing technique, you really need to understand what it is doing. I'm not going to fully describe loess techniques (or smoothing splines, or kernel density functions, etc.). There are plenty of resources online. Often times the degree of smoothing is up to the researcher. However, it is almost always the case in baseball analysis that we want to compare one smoothed representation or heat map to another one. This is where things get tricky. You'll need to make sure you are not oversmoothing (too wiggly) or undersmoothing (not wiggly enough).
Sample size is of course the first issue. If you are going to present BABIP by pitch location for a single batter or pitcher, you are likely going to need to regress the data a lot. Pitch data are extremely noisy. Secondly, you really need to account for the fact that the batter CHOSE to swing at those pitches. The pitches that a batter swings at is a distribution nested within all pitches thrown. Then, the pitches that are contacted with are yet another subset of this distribution. Sometimes, this is no big deal. Other times, it is an extremely big deal.
Let's ignore the second issue above and focus first on sample size and comparisons. Next, let's restrict ourselves to evaluating the likelihood of a hit on contacted pitches. Let's say one batter has made contact with about 1600 pitches, while another has made contact with about 250 pitches in our sample. Let's just ignore the 'regression to the mean' issue here as well. You know what, let's make it the same batter in both cases, with the second being a random sample of the first bunch. If we use exactly the same smoothing parameters for each (with no restriction for the distribution being binomial, which technically it should be--more thoughts on this issue here) we will get the following (extremely rough, and somewhat ugly--keep in mind I am not regressing here, just showing what happens with the different sample sizes) comparison below:
Because I have not restricted the data to be between 0 and 1, just assume the white splotches are where the probability of a hit on a ball in play is 0% (i.e. white==really really cold zone--I will leave aside the VERY IMPORTANT issue of ensuring the same color scaling on the sidebar for another post!). You can see above that, even though we're looking at the same player, the maps are very different. There are likely many problems here, as we would not expect pitches low and down the middle (remember, this is all within the strike zone) to be almost a 0% chance of a hit. Why? Well, the player above is Albert Pujols. Plus, when we look at the full data on balls in play, we see that the probability is closer to what we would expect (though, according to this data, still a cold zone).
You can also see that one plot shows a hot zone on the outer half, while the subset shows hot zones up and in as well as at the bottom of the zone. This is a result of having very little data in these areas, and it is ultimately overweighted with the given smoothing parameter. If Pujols gets one hit out of two pitches at the knees, it reports his BABIP to be .500 if we do not smooth enough or weight it along with other player data. Of course, we wouldn't expect him to have a .500 BABIP in the future on these pitches. Throw him 1000 pitches there to swing at, and he is really not likely to get 500 hits.
So, with the same smoothing parameters, these plots really are not comparable to one another.
Now, we could reconsider the smoothing parameter for the smaller data set (probably a good idea!). However, the problem is that we don't know at what point of smoothing we're overfitting or underfitting. You can imagine the problem is much more difficult when we are comparing two players against one another.
One way to attack this issue is through using a generalized cross-validation technique (this can be done with the "mgcv" package in R). Using this method, I have found that we need a large sample size of pitches. The method really breaks down for the small subset; however, it allows not only for a binomial representation of the data (rather than smoothing it with an assumed Gaussian distribution), but also to optimize the smoothing parameter to compare across different sample sizes and distributions of pitches and BABIP.
Okay, I could go on for a loooooong time and get really "mathy" with the considerations I mention above. However, I'll just point everyone toward the book on Generalized Additive Modeling by Simon Wood (2006). It is honestly one of the best resources I have ever come across in statistics, but to implement this with Pitch F/X you generally need a pretty large data set. One needs to be careful and be sure to fully understand all of the options that can be implemented. Using this method with the right type of data, you can ultimately create something like this (a strike zone map):
Before I get too far off on a tangent, let's return to the initial point of this post: COLOR. For this, I'll stick with strike zone maps.
The first question is: Why in the hell would we want to use color anyway?
The answer is: It can help to communicate muddy scatterplots more easily.
For example, below we have three scatterplots: Called Balls, Called Strikes, and the two combined. It is easy to tell where the definite strikes and definite balls are, but when we overlap the two plots, the strike call likelihood becomes nearly nearly uninterpretable at the edges.
There is another consideration--noted by J-Doug--is color blindness. The Green-to-Red plots for BABIP are likely a poor choice (as are the scatter plots shown above!). Many people (about 8% of males) are unable to discern greens and reds. So using these within the same image is a bad idea. One way to evaluate your colors is to see if they are interpretable in black and white. Let's check out the strike zone plot in black and white:
With the colors I use, it seems that for someone with complete color blindness, I have failed this test. However, with some knowledge of a strike zone, this person would be able to understand that the dark within the zone is high strike probability, while the dark outside the zone is low strike probability. They are also able to find that spot where the strike probabilities are changing the most (but this likely isn't satisfactory). Which brings me to my next consideration...
Color is an important factor of your visual depending on what you want to highlight to the reader. In the first heat maps, we may want to be able to read the smoothing across the strike zone at a very granular level. However, for the strike zone map above, we may be more interested in the place where the likelihood of a pitch being called a ball becomes higher than the likelihood of the pitch being called a strike (here, the yellowish-whitish band).
When considering interest in the gradual changes across a heatmap, I find it a good idea to use a single color. This way, there is not this "breakpoint" from red-to-blue or from green-to-yellow. The same color gets lighter and lighter as you go. Below I have an example of using the same color for determining densities of called strike locations (i.e. where they are thrown least or most). Here, I use a "red-to-white" palette and then switch it to "white-to-red".
I would love to get comments on others' opinions about and experiences with using color. I have not gone too in-depth, but I hope to follow up with a number of examples of color use for the same image and how this can allow highlighting certain areas of a visual. Also, with feedback, we can try and develop a consensus on what the optimal choices are for the majority.