Thursday, March 31, 2011

Data Quality in Pitch F/X

This post stems from a discussion with Mike Fast about quality of the "sz_top" and "sz_bot" variables in Pitch F/X data. I had been using these to designate the strike zone for my calculations in past posts. I want to thank Mike for being generous with his time to answer some of my questions and keep me from publicly writing something stupid.

I was aware that the lines drawn for Top and Bottom of the zone were somewhat inaccurate. However, one thing I did not count on would be that this variation would systematically bias findings in the data across years. As a whole, we would normally expect that these measurement errors are random (for a given player, not across players). In theory, random measurement errors are totally fine. While they make the data noisy, they should not bias our measurements and with really big data, they should be mostly ignorable when we do certain calculations.

But over time, this just doesn't seem to be the case. This is the main reason I took down the data from my last post (I'll update it as soon as I can and repost it). The inaccuracy of the data tends to stem from the correlation between the zone designation at the top and bottom, and the percentage of pitches WITHIN the zone also called strikes. That's no surprise, and normally I wouldn't worry too much about this as we'd expect it's simply noise and we'd expect some uniform change inside and outside the zone if we change the size of the strike zone.

However, the interesting part is that it seems to have a minimal effect--if any--on the pitches correctly called Balls that are actually outside the zone. I'm still not sure why this is the case. We'd expect that fixing the zone would similarly affect the percentage of correctly called pitches both within and outside the zone (after all, any that are no longer 'outside' the zone MUST be 'within' the zone--though less on the 'outside zone' data because there are more pitches outside the zone than within the zone). The only thing I can think of is that it's a sample size issue: there are many more pitches outside the rulebook zone than inside the zone (just under 3 times as many). But I can't imagine this accounts for such a huge change in one and almost no change in the other.

With that said, I thought I would provide some data for those looking to mess with these variables in the Pitch F/X data. In the file linked here, I have calculated the average Top and Bottom of zone for each player in each year, along with the standard deviation. The data are in both feet and in inches. Below, I also show the range of values for sz_top for Bobby Abreu in 2007, 2009 and 2010 (I skip 2008 for now). Finally, I give a distribution of standard deviations for the measurement by player from 2007 through 2010. From the looks of things, something was changed in mid-2007 about how they designated the top of the zone (notice the bimodal distribution).

Anyway, just a heads up. Like I said, I'm still not clear on why this is systematically changing the Within Zone tabulations but NOT the Outside Zone tabulations. I'll post the file once I figure out what is going on.

