Saturday, October 29, 2011

Maximizing Sabermetric Visual Content: Smooth Comparisons and Leveraging Color

A recent post by Mike Fast got me thinking a bit more about color. For most, thoughts about color generally become a secondary interest. But I am here to tell you they should be a primary concern in your statistical presentations. This is especially true when analyzing the strike zone.

Before you begin reading this, please read Mike's excellent post over at Baseball Prospectus. Then, go ahead and read this article at Praiseball Bospectus (linked not because of its title--I am very glad Dave Allen was born--but because it really does highlight some issues with things you'll find around the net).

Okay, now that you have read that, here are my additional comments. First, heat maps should be approached with caution. This is true whether or not you are smoothing or simply breaking up the zone into smaller areas. Mike covers this well, but I will take it a little further with smoothing.

When you use a smoothing technique, you really need to understand what it is doing. I'm not going to fully describe loess techniques (or smoothing splines, or kernel density functions, etc.). There are plenty of resources online. Often times the degree of smoothing is up to the researcher. However, it is almost always the case in baseball analysis that we want to compare one smoothed representation or heat map to another one. This is where things get tricky. You'll need to make sure you are not oversmoothing (too wiggly) or undersmoothing (not wiggly enough).

Sample size is of course the first issue. If you are going to present BABIP by pitch location for a single batter or pitcher, you are likely going to need to regress the data a lot. Pitch data are extremely noisy. Secondly, you really need to account for the fact that the batter CHOSE to swing at those pitches. The pitches that a batter swings at is a distribution nested within all pitches thrown. Then, the pitches that are contacted with are yet another subset of this distribution. Sometimes, this is no big deal. Other times, it is an extremely big deal.

Let's ignore the second issue above and focus first on sample size and comparisons. Next, let's restrict ourselves to evaluating the likelihood of a hit on contacted pitches. Let's say one batter has made contact with about 1600 pitches, while another has made contact with about 250 pitches in our sample. Let's just ignore the 'regression to the mean' issue here as well. You know what, let's make it the same batter in both cases, with the second being a random sample of the first bunch. If we use exactly the same smoothing parameters for each (with no restriction for the distribution being binomial, which technically it should be--more thoughts on this issue here) we will get the following (extremely rough, and somewhat ugly--keep in mind I am not regressing here, just showing what happens with the different sample sizes) comparison below:























Because I have not restricted the data to be between 0 and 1, just assume the white splotches are where the probability of a hit on a ball in play is 0% (i.e. white==really really cold zone--I will leave aside the VERY IMPORTANT issue of ensuring the same color scaling on the sidebar for another post!). You can see above that, even though we're looking at the same player, the maps are very different. There are likely many problems here, as we would not expect pitches low and down the middle (remember, this is all within the strike zone) to be almost a 0% chance of a hit. Why? Well, the player above is Albert Pujols. Plus, when we look at the full data on balls in play, we see that the probability is closer to what we would expect (though, according to this data, still a cold zone).

You can also see that one plot shows a hot zone on the outer half, while the subset shows hot zones up and in as well as at the bottom of the zone. This is a result of having very little data in these areas, and it is ultimately overweighted with the given smoothing parameter. If Pujols gets one hit out of two pitches at the knees, it reports his BABIP to be .500 if we do not smooth enough or weight it along with other player data. Of course, we wouldn't expect him to have a .500 BABIP in the future on these pitches. Throw him 1000 pitches there to swing at, and he is really not likely to get 500 hits.

So, with the same smoothing parameters, these plots really are not comparable to one another.

Now, we could reconsider the smoothing parameter for the smaller data set (probably a good idea!). However, the problem is that we don't know at what point of smoothing we're overfitting or underfitting. You can imagine the problem is much more difficult when we are comparing two players against one another.

One way to attack this issue is through using a generalized cross-validation technique (this can be done with the "mgcv" package in R). Using this method, I have found that we need a large sample size of pitches. The method really breaks down for the small subset; however, it allows not only for a binomial representation of the data (rather than smoothing it with an assumed Gaussian distribution), but also to optimize the smoothing parameter to compare across different sample sizes and distributions of pitches and BABIP.

Okay, I could go on for a loooooong time and get really "mathy" with the considerations I mention above. However, I'll just point everyone toward the book on Generalized Additive Modeling by Simon Wood (2006). It is honestly one of the best resources I have ever come across in statistics, but to implement this with Pitch F/X you generally need a pretty large data set. One needs to be careful and be sure to fully understand all of the options that can be implemented. Using this method with the right type of data, you can ultimately create something like this (a strike zone map):

Before I get too far off on a tangent, let's return to the initial point of this post: COLOR. For this, I'll stick with strike zone maps.

The first question is: Why in the hell would we want to use color anyway?
The answer is: It can help to communicate muddy scatterplots more easily.

For example, below we have three scatterplots: Called Balls, Called Strikes, and the two combined. It is easy to tell where the definite strikes and definite balls are, but when we overlap the two plots, the strike call likelihood becomes nearly nearly uninterpretable at the edges.


There is another consideration--noted by J-Doug--is color blindness. The Green-to-Red plots for BABIP are likely a poor choice (as are the scatter plots shown above!). Many people (about 8% of males) are unable to discern greens and reds. So using these within the same image is a bad idea. One way to evaluate your colors is to see if they are interpretable in black and white. Let's check out the strike zone plot in black and white:


With the colors I use, it seems that for someone with complete color blindness, I have failed this test. However, with some knowledge of a strike zone, this person would be able to understand that the dark within the zone is high strike probability, while the dark outside the zone is low strike probability. They are also able to find that spot where the strike probabilities are changing the most (but this likely isn't satisfactory). Which brings me to my next consideration...

Color is an important factor of your visual depending on what you want to highlight to the reader. In the first heat maps, we may want to be able to read the smoothing across the strike zone at a very granular level. However, for the strike zone map above, we may be more interested in the place where the likelihood of a pitch being called a ball becomes higher than the likelihood of the pitch being called a strike (here, the yellowish-whitish band).

When considering interest in the gradual changes across a heatmap, I find it a good idea to use a single color. This way, there is not this "breakpoint" from red-to-blue or from green-to-yellow. The same color gets lighter and lighter as you go. Below I have an example of using the same color for determining densities of called strike locations (i.e. where they are thrown least or most). Here, I use a "red-to-white" palette and then switch it to "white-to-red".


I would love to get comments on others' opinions about and experiences with using color. I have not gone too in-depth, but I hope to follow up with a number of examples of color use for the same image and how this can allow highlighting certain areas of a visual. Also, with feedback, we can try and develop a consensus on what the optimal choices are for the majority.

Thursday, October 27, 2011

Article at JQAS: Baseball Hall of Fame Voting

The newest issue of the Journal of Quantitative Analysis in Sports has been published today, and it features a number of interesting articles. In honor of shameless self-promotion, I would like to highlight the following article:

Using Tree Ensembles to Analyze National Baseball Hall of Fame Voting Patterns: An Application to Discrimination in BBWAA Voting
Brian M. Mills and Steven Salaga

The link above should be un-gated. If it is not, please let me know and I can share the article. This is my first, first-author academic publication so go easy on me (and Steve). If you read my recent post about our joint poster at the 2011 Joint Statistical Meetings in Miami, this analysis should sound rather familiar. Please place questions or feedback in the comments if you have them, or feel free to shoot me an email.

We began this work a while back actually as a class project, and decided to turn it into an academic paper with some guidance and encouragement from our adviser. A paper using the same technique came out last year (Frieman, 2010) which gave us a chance to add to this work by including pitcher predictions and extending the work to the economic literature on discrimination in Hall of Fame voting. Our work differs somewhat from Frieman, and this is explained within the paper. In fact, a (very) preliminary version of the work was on this website a while back; however, after the Frieman paper was published, I was worried a bit about getting scooped even more so (no foul play there--just happened to be doing a very similar analysis at the same time).

Of course, R was used exclusively for the analysis. Also, you may note that some familiar names are cited. These include Cy Morong, Bill James, Jayson Stark, Peter Gammons, Tom Verducci, Chris Jaffe and, yes, Tom Tango (related to the Tim Raines site, of course).

If you have crtiticisms, please present them respectfully and keep in mind that we don't think this analysis (or ANY analysis) is the last word on any issue. And also keep in mind future predictions are only based on statistics as of 2009 (without career projections). So they predict future induction under the assumption of retirement after the 2009 season. But it was a lot of fun and it shows some promising results for the using technique in sports prediction. There is a lot of Hall of Fame voting literature out there, and this is another addition to it. Hopefully we can have a comprehensive model of hockey players soon now, too.

Tuesday, October 25, 2011

Sabermetrics Meets R Meetup

I just ran across this post at Big Computing. On November 14th, there will be an R User meet-up in Washington, DC (Tyson's Corner) led by Mike Driscoll about using R for sabermetric analysis (linked here). I will actually be home in Maryland for a couple weeks, and likely in DC on that Monday so there's a good chance that I will try and stop by this meet-up. If anyone else is in the area and would like to come by, let me know. I always enjoy meeting fellow statistics/sports dorks. I imagine this will be a great extension to the tutorials that I have had here, coming from someone with much more expertise in statistics and statistical programming than I.

Hat Tip: Kirk Mettler

Thursday, October 13, 2011

Insane Musings on Realignment

I back in Maryland this past weekend for a wedding and visiting my fiancee's family. Her father is a massive G-Town fan, graduate and has been on the admissions board and academic advisory committee there. He drives 2 hours each way to go to all of the basketball games. I get blasted for not screaming and cheering when I go. But it's all in good fun.

He's disgusted by the inability of the Big East to hang on to it's big name schools in recent years, and worries that Georgetown is going to have difficulty recruiting without the big name FBS schools in the conference.

This got me thinking: football is definitely a big winner, but there are a lot of basketball fans out there, too. Smaller alumni bases make it difficult to estimate a television contract, but I would not be surprised to see basketball-only schools (and perhaps Notre Dame non-football) realigning to form their own national basketball mid-major powerhouse conference. There are endless possibilities, but I see the following fitting together nicely in a conference like this (or, you could also just realign to a Catholic basketball conference with many of them):

Georgetown
Notre Dame
Villanova
St. Johns
Providence
Gonzaga
Depaul
Xavier
Marquette
Butler
St. Mary's
Temple
Duquesne
Old Dominion
Creighton
Memphis

And possibly:
Davidson
Seton Hall
George Washington
George Mason
Richmond (at the suggestion of Brian in the comments)

Obviously, this depends on whether or not schools like UConn, Louisville and West Virginia have enough clout to pull in significant conference revenue on the basketball side (perhaps Basketball and Football get some kind of package deal for the conference?). But I wouldn't be surprised to see something like this happen. Realigning so that there is still high quality competition within the conference could help all schools there recruit. Notre Dame would likely be joining a BCS level conference. Georgetown would obviously be the big wild card here on whether or not something like this happens and they may have a lot of pride, not wanting to stray from the BCS type schools. I really don't know. I think it would be fun to watch, though.

Then again, maybe (probably) it's a silly idea. What say you?