Saturday, October 29, 2011

Maximizing Sabermetric Visual Content: Smooth Comparisons and Leveraging Color

A recent post by Mike Fast got me thinking a bit more about color. For most, thoughts about color generally become a secondary interest. But I am here to tell you they should be a primary concern in your statistical presentations. This is especially true when analyzing the strike zone.

Before you begin reading this, please read Mike's excellent post over at Baseball Prospectus. Then, go ahead and read this article at Praiseball Bospectus (linked not because of its title--I am very glad Dave Allen was born--but because it really does highlight some issues with things you'll find around the net).

Okay, now that you have read that, here are my additional comments. First, heat maps should be approached with caution. This is true whether or not you are smoothing or simply breaking up the zone into smaller areas. Mike covers this well, but I will take it a little further with smoothing.

When you use a smoothing technique, you really need to understand what it is doing. I'm not going to fully describe loess techniques (or smoothing splines, or kernel density functions, etc.). There are plenty of resources online. Often times the degree of smoothing is up to the researcher. However, it is almost always the case in baseball analysis that we want to compare one smoothed representation or heat map to another one. This is where things get tricky. You'll need to make sure you are not oversmoothing (too wiggly) or undersmoothing (not wiggly enough).

Sample size is of course the first issue. If you are going to present BABIP by pitch location for a single batter or pitcher, you are likely going to need to regress the data a lot. Pitch data are extremely noisy. Secondly, you really need to account for the fact that the batter CHOSE to swing at those pitches. The pitches that a batter swings at is a distribution nested within all pitches thrown. Then, the pitches that are contacted with are yet another subset of this distribution. Sometimes, this is no big deal. Other times, it is an extremely big deal.

Let's ignore the second issue above and focus first on sample size and comparisons. Next, let's restrict ourselves to evaluating the likelihood of a hit on contacted pitches. Let's say one batter has made contact with about 1600 pitches, while another has made contact with about 250 pitches in our sample. Let's just ignore the 'regression to the mean' issue here as well. You know what, let's make it the same batter in both cases, with the second being a random sample of the first bunch. If we use exactly the same smoothing parameters for each (with no restriction for the distribution being binomial, which technically it should be--more thoughts on this issue here) we will get the following (extremely rough, and somewhat ugly--keep in mind I am not regressing here, just showing what happens with the different sample sizes) comparison below:

Because I have not restricted the data to be between 0 and 1, just assume the white splotches are where the probability of a hit on a ball in play is 0% (i.e. white==really really cold zone--I will leave aside the VERY IMPORTANT issue of ensuring the same color scaling on the sidebar for another post!). You can see above that, even though we're looking at the same player, the maps are very different. There are likely many problems here, as we would not expect pitches low and down the middle (remember, this is all within the strike zone) to be almost a 0% chance of a hit. Why? Well, the player above is Albert Pujols. Plus, when we look at the full data on balls in play, we see that the probability is closer to what we would expect (though, according to this data, still a cold zone).

You can also see that one plot shows a hot zone on the outer half, while the subset shows hot zones up and in as well as at the bottom of the zone. This is a result of having very little data in these areas, and it is ultimately overweighted with the given smoothing parameter. If Pujols gets one hit out of two pitches at the knees, it reports his BABIP to be .500 if we do not smooth enough or weight it along with other player data. Of course, we wouldn't expect him to have a .500 BABIP in the future on these pitches. Throw him 1000 pitches there to swing at, and he is really not likely to get 500 hits.

So, with the same smoothing parameters, these plots really are not comparable to one another.

Now, we could reconsider the smoothing parameter for the smaller data set (probably a good idea!). However, the problem is that we don't know at what point of smoothing we're overfitting or underfitting. You can imagine the problem is much more difficult when we are comparing two players against one another.

One way to attack this issue is through using a generalized cross-validation technique (this can be done with the "mgcv" package in R). Using this method, I have found that we need a large sample size of pitches. The method really breaks down for the small subset; however, it allows not only for a binomial representation of the data (rather than smoothing it with an assumed Gaussian distribution), but also to optimize the smoothing parameter to compare across different sample sizes and distributions of pitches and BABIP.

Okay, I could go on for a loooooong time and get really "mathy" with the considerations I mention above. However, I'll just point everyone toward the book on Generalized Additive Modeling by Simon Wood (2006). It is honestly one of the best resources I have ever come across in statistics, but to implement this with Pitch F/X you generally need a pretty large data set. One needs to be careful and be sure to fully understand all of the options that can be implemented. Using this method with the right type of data, you can ultimately create something like this (a strike zone map):

Before I get too far off on a tangent, let's return to the initial point of this post: COLOR. For this, I'll stick with strike zone maps.

The first question is: Why in the hell would we want to use color anyway?
The answer is: It can help to communicate muddy scatterplots more easily.

For example, below we have three scatterplots: Called Balls, Called Strikes, and the two combined. It is easy to tell where the definite strikes and definite balls are, but when we overlap the two plots, the strike call likelihood becomes nearly nearly uninterpretable at the edges.

There is another consideration--noted by J-Doug--is color blindness. The Green-to-Red plots for BABIP are likely a poor choice (as are the scatter plots shown above!). Many people (about 8% of males) are unable to discern greens and reds. So using these within the same image is a bad idea. One way to evaluate your colors is to see if they are interpretable in black and white. Let's check out the strike zone plot in black and white:

With the colors I use, it seems that for someone with complete color blindness, I have failed this test. However, with some knowledge of a strike zone, this person would be able to understand that the dark within the zone is high strike probability, while the dark outside the zone is low strike probability. They are also able to find that spot where the strike probabilities are changing the most (but this likely isn't satisfactory). Which brings me to my next consideration...

Color is an important factor of your visual depending on what you want to highlight to the reader. In the first heat maps, we may want to be able to read the smoothing across the strike zone at a very granular level. However, for the strike zone map above, we may be more interested in the place where the likelihood of a pitch being called a ball becomes higher than the likelihood of the pitch being called a strike (here, the yellowish-whitish band).

When considering interest in the gradual changes across a heatmap, I find it a good idea to use a single color. This way, there is not this "breakpoint" from red-to-blue or from green-to-yellow. The same color gets lighter and lighter as you go. Below I have an example of using the same color for determining densities of called strike locations (i.e. where they are thrown least or most). Here, I use a "red-to-white" palette and then switch it to "white-to-red".

I would love to get comments on others' opinions about and experiences with using color. I have not gone too in-depth, but I hope to follow up with a number of examples of color use for the same image and how this can allow highlighting certain areas of a visual. Also, with feedback, we can try and develop a consensus on what the optimal choices are for the majority.


  1. Great job, Millsy.

    I'd like to add that most people who are color blind can still see plenty of color, but we have some trouble with different shades (usually, but not always, red and green).

    There's a neat site called Vischeck that allows you to upload an image, or scan a whole website, and see what it looks like to people with different sorts of color vision impairment:

  2. Cool website. To clarify, I did not mean to imply that color blindness necessarily meant black and white vision. I think it can simply be used as a
    "worst case scenario" check. Thanks for the input and the link. I'll have to play around with that website!

  3. Yeah, I knew what you were getting at. B/W is definitely a worst case check, and most R/G confusion can be spotted by using it.

    Another easy check is, if you have graphic editing software, slowly reducing the saturation and seeing if this takes away information from your graphic.

  4. I would rather interpret an MRI or a CT scan.

  5. Anonymous,

    I assume you are referring to the first two plots (Pujols data). I certainly don't disagree with you. It's just really not all that useful.

    As for the umpire visuals, I think they're relatively straight forward.

  6. Just reading this now. Thanks for taking the time to talk about this. I did some terrible, terrible things with heatmaps when I first started making them a long time ago, but thanks to you I'm [hopefully] past that.

    What is your opinion on the default ggplot2 color palette? I like the default colors for heatmaps/geom_tile (red and blue), but I don't think they visualize small differences well, so I often change the palette to either red + white + blue or black and white.

  7. Hey Josh,

    I actually don't know much about ggplot2. I have stuck to base graphics in R, much to the dismay of many R users I have met. I'm somewhat agnostic on ggplot2. I hear it's easier to use but don't really know the code. The graphs look nice, and I think the default colors are okay. Do you have a specific example of it? My answer would probably be that it depends on what they're being used for.

  8. I love ggplot2. Some things are definitely much easier than base or lattice. For example, I made the first plot in this post with just 3 or 4 lines of code: . ggplot2 has it's own two-dimensional density estimation method built in to a plotting function ( stat_density2d:, so there is no need to do kde2d + filled.contour. I guess you can do a one-liner with scatterSmooth, but I don't think they look as good and have less options.

    You can also easily do one plot, conditioned on multiple variables really easily. Just add in facet_wrap(~variable to facet by) and you have the same graph made for multiple subsets of data. So yea, I love ggplot2 and I'd highly recommend it. I think it looks great and it makes more sense to me. I'd use it for everything if I could (can't do 3d though).

    Sorry about that ggplot2 rant, Back to the color question:

    The first plot here has the default colors:

    I think it looks nice, but color separation looks limited to me. Medium density locations are hard to pick out.

    Here I changed the default to a red + white + blue for a pitch density plot:

    I think it shows the separation better, but this graph also has some smoothing issues because it's the difference between two kde2d estimates.

    Is one of these color palettes better? I'm not sure. I also might use the black and white one more.

  9. Ah. Yes, in the Colon plot I really like those colors. I think there could be a bit more light-to-dark contrast, though. It's always good to use colors that blend with one another (i.e. not complementary colors). Two primary colors is usually a good bet. That way, you either have no red OR no green in your plot.

    Like I said above, I use the Yellow between the red and blue in my plots in order to highlight the 'breakpoint' at which strike percent goes from above 50% to below 50%. Unfortunately, this does limit the blending smoothness throughout.

    I think some middle-ground between the Colon and Pujols plots would be ideal from those examples. The Pujols plot is almost too light, while the other--as you say--makes it difficult to interpret the middling values in the plot.

    But of course, there's also plenty of opinion in those statements. Keep up the good work. I've enjoyed watching you expand the use of mgcv and other packages to look at area of swing/umpire/contact zones.

  10. Hi,

    The clients can showcase their products and services by using such lists to reach out the Target Businesses or Consumers.

    Business & Consumer Database

    A premier database organization providing all types of Business & Consumer Database to reach out your target audience.

  11. Great share!!! it was detailed research thanks for it. Resource leveling is a technique in project management that overlooks resource allocation and resolves possible conflict arising from over-allocation. When project managers undertake a project, they need to plan their resources accordingly. This will benefit the organization without having to face conflicts and not being able to deliver on time. Resource leveling is considered one of the key elements to resource management in the organization. Primavera P6 tool is really usefull to level our project using both its project and activity leveling priorities Primavera Training
    MS Project Training