There's a discussion that piques my interest going on over at The Book Blog involving the economics of fantasy valuation. I haven't chimed in at this point, but it's certainly a problem I've also thought about when doing my fantasy valuations. I talked to my adviser about this yesterday at the bar and convinced him that this is an interesting and worthwhile topic, but fantasy sports--once the forecast is made--is simply a linear programming problem. The one who spends the most time solving that optimization problem usually wins. But, of course, the question is: what is the answer? And I think the answer in the context of fantasy is: it depends.

There's plenty of game theory that goes on. Depending on who you play with, there's a good chance that each of the other owners' behavior needs to be an input in the optimization problem. Because you don't know what others will do, in theory you should be bidding your valuation (assuming a second price blind auction). In an English auction it's a little different (the standard fantasy auction), as you get information about other owners during the process. But passing up buying someone now is not independent of what happens later on. And that's where the difficulty comes in with solving the problem.

Assuming everyone has exactly the same valuations, there is still room for zigging and zagging. Ultimately, if everyone pays exactly what the same price guide says each player is worth, then in the end the winner of the league simply results from luck. Essentially, if you really want to win fantasy and win some money, you should prey on fans of specific players who take them for the sake of this, those who think they just have a 'knack' for knowing what is going to happen, etc.

The institution of a keeper league, minor leaguers, contracts, etc. simply make this optimization problem more complex and more difficult to solve and adds to the variability, given the need for long-term risk assessment (or, forecasting). At this point, if everyone simply uses some variant of the z-score method with replacement at $1, then it doesn't matter much what the exact answer to the problem is unless you only know it yourself. Otherwise, the equilibrium just shifts to that method, and everyone does the same thing. What we do know (or at least from my experience) is that at this point, the z-score method seems sufficient in roto leagues to win. But that's because the majority of people use it. I haven't seen any place give drastically different valuations for the same projections (in fact, they're all almost exactly the same...except for ESPN with their insane premium for stolen bases).

I suspect there is someone who knows something close the answer to the optimization problem with the game theory and psychological aspect included: Eriq Gardner at Fantasy Ball Junkie and Bloomberg Sports. I only say this because playing with him in our 20-team, H2H, 8x8 league for a few years, he won the league in the first 3 straight, and came in 2nd this last year. An incredible feat, considering this is a keeper league and others recharged their keepers each year, and H2H in the playoffs can be extremely variable. Perhaps me and my buddies in the league just aren't very good, but it's certainly the most competitive league I've played in. I'll likely join another league with Eriq this year, so we'll see if I really know my stuff this time around.

One thing I have learned from Eriq is that the best way around a complex linear optimization is to find gaps in insufficient rule structures. Always look for inefficiencies that are based on the rules. If you can do this, you'll probably increase your chances of winning anywhere from 50% to 100% of what they were to begin with (and possibly much more). This is the best way to get close to the optimal answer to the problem. Standard leagues are well-known, so this is more difficult. However, I think the discussion at Tango's blog gets to the extreme inefficiency I saw in valuations this past year: Starting Pitching. While not discussed directly, the sums spent on the likes of Tim Lincecum and other established starters put a lot of people in a hole to begin their auctions. I discussed this a lot this past fantasy season, putting together a staff of:

Josh Johnson

Ubaldo Jimenez

Yovanni Gallardo

Clayton Kershaw

Francisco Liriano

Shaun Marcum

Colby Lewis

Phil Hughes

Neftali Feliz

ALL for the price that Lincecum went for in that same league. So, I guess my answer to the question is: look for loopholes. Solving the question of the perfect fantasy roster using z-scores may not give you a look into this. My own adjusted z-score method did some of this. Other places tend to use some sort of exponent for the top level players, then do the valuations. I think that in a lot of cases this is incorrect. This can be very true for relievers as well, as their ERA and WHIP help in daily lineups is invaluable for how amazingly cheap they go.

One last addition as to why the z-score method can falter: it is not dynamic. Remember that in a rank-based Roto system, you have diminishing returns to adding to a category. If you already have 320HR on your team, then Adam Dunn isn't worth as much as he would be if you have 100 HR. Same goes for every category. This is the theory behind my simulations for H2H roto league over at FBJ (and cross-posted here). My hope is to still come up with full valuations based on that method if I have the time to do so.

Here's the link. I'm also interested to hear what people say:

http://www.insidethebook.com/ee/index.php/site/article/economics_of_fantasy_valuation/

## Tuesday, December 21, 2010

## Friday, December 17, 2010

### Joe West vs. Bruce Froemming: A Crude Umpire LHB/RHB Bias Comparison

In my last two posts, I have tinkered with the 'gam' package to create heat maps for individual umpire strike zones. I went ahead and grabbed Joe West's data (which has a lot more pitches than Bruce Froemming in it, since Froemming's data is only from 2007). Below, I have mapped them out with a new color scheme (those of you reading this, I'm curious as to your opinion on the better scheme). West's data is aggregated for 2007 through 2010.

Remember in my last post that Bruce Froemming tended to call a larger strike zone for Right Handed Batters (the opposite of J-Doug's finding at Beyond the Boxscore, and my own regression analysis that found similar results to J-Doug). That led me to map out the new zones for West to see if we find any differences. I'll just start with the "All Batters" maps below to get a feel for the strike zone. Nothing too striking (though, I will mention here that the 'span' is not the same for each map, as the larger amount of observations for West's map resulted in reducing the span). These aren't all that useful, though, as the zones differ so much for different batting handedness. In my next post, I'll break them down into pitch types and counts, and possibly see if West changed over the 4 years. Finally, I'll try to break down by pitcher handedness as well.

Now, one thing to notice with the new color scheme is that we get a little more information in the outer area of the probabilities of a strike call. That's good to have, and we can see that there are strikes called a little further outside the zone than the other color scheme had indicated with the naked eye. This scheme is the inverse of an RColorBrewer palette. I'll show some code later on in the post to get it to work out this way.

In general, it seems as though West does not call strikes very far above or below the strike zone, while he might be extending it a bit further inside and outside. However, we can also see that it's not symmetrical on each side of the plate. Now let's take a look at RHB vs. LHB and where this is coming from. Beginning with LHB, it seems as though West's strike zone is a bit larger than Froemming's for lefties, and is especially true for the inside portion of the plate.

What about righties? The RHB zone seems to stretch well outside the horizontal strike zone one both sides. So, West seems to be much more likely to call a strike on the inside part of the plate. That's good for pitchers. Again, though, West is less forgiving with high and low pitches than Froemming seems to be. But much of this is likely an artifact of a few outliers in Froemming's map, the different spans, and the fact that there are a lot less data points for Froemming than for West--for more info on running a 'gam' model that is a bit more robust to outliers, see this site run by previous commenter Matias. I'd also recommend some standardized way to choose the optimal span, which I'll be working with in the coming weeks to ensure that things are more easily comparable across umpires.

While these plots certainly don't answer the question of which umpires are discriminating against lefties more, they certainly lead us to believe that it may be a good idea to have fixed effects (or, just dummy variables for each umpire) in a regression model, perhaps with some interaction terms regarding left and right handed batters. There's plenty of data, so I see little reason to worry about running out of degrees of freedom here. It seems pretty obvious that the strike zones for these two umpires are shaped quite differently, and interestingly, West does not call the outside pitch against right-handers like he does with LHBs. That seems like a disadvantage to the pitcher when facing a RHB. What do you think?

By using a fixed-effects approach, we can see where the lefty-righty bias is coming from in umpire calls, and whether or not it is something across the entire population of umpires, or skewed by a select few discriminating against left-handed batters, whether it be because of stance or unconscious bias. If we have data on umpire handedness (something discussed recently at The Book Blog), this might give us some insight into how they favor their squat behind the plate. Mike Fast has also suggested that batter stance biases the umpires, so it may be interesting to find some data on how close batters (left vs. right) crowd the plate. Any takers?

(sorry they're not side-by-side...originally the post was like this, but Blogger decided to reformat things and I can't seem to get it to format correctly...though it depends on what computer screen you're looking at this with apparently).

CODE (using the Pretty R-Tool):

Remember in my last post that Bruce Froemming tended to call a larger strike zone for Right Handed Batters (the opposite of J-Doug's finding at Beyond the Boxscore, and my own regression analysis that found similar results to J-Doug). That led me to map out the new zones for West to see if we find any differences. I'll just start with the "All Batters" maps below to get a feel for the strike zone. Nothing too striking (though, I will mention here that the 'span' is not the same for each map, as the larger amount of observations for West's map resulted in reducing the span). These aren't all that useful, though, as the zones differ so much for different batting handedness. In my next post, I'll break them down into pitch types and counts, and possibly see if West changed over the 4 years. Finally, I'll try to break down by pitcher handedness as well.

Now, one thing to notice with the new color scheme is that we get a little more information in the outer area of the probabilities of a strike call. That's good to have, and we can see that there are strikes called a little further outside the zone than the other color scheme had indicated with the naked eye. This scheme is the inverse of an RColorBrewer palette. I'll show some code later on in the post to get it to work out this way.

In general, it seems as though West does not call strikes very far above or below the strike zone, while he might be extending it a bit further inside and outside. However, we can also see that it's not symmetrical on each side of the plate. Now let's take a look at RHB vs. LHB and where this is coming from. Beginning with LHB, it seems as though West's strike zone is a bit larger than Froemming's for lefties, and is especially true for the inside portion of the plate.

What about righties? The RHB zone seems to stretch well outside the horizontal strike zone one both sides. So, West seems to be much more likely to call a strike on the inside part of the plate. That's good for pitchers. Again, though, West is less forgiving with high and low pitches than Froemming seems to be. But much of this is likely an artifact of a few outliers in Froemming's map, the different spans, and the fact that there are a lot less data points for Froemming than for West--for more info on running a 'gam' model that is a bit more robust to outliers, see this site run by previous commenter Matias. I'd also recommend some standardized way to choose the optimal span, which I'll be working with in the coming weeks to ensure that things are more easily comparable across umpires.

While these plots certainly don't answer the question of which umpires are discriminating against lefties more, they certainly lead us to believe that it may be a good idea to have fixed effects (or, just dummy variables for each umpire) in a regression model, perhaps with some interaction terms regarding left and right handed batters. There's plenty of data, so I see little reason to worry about running out of degrees of freedom here. It seems pretty obvious that the strike zones for these two umpires are shaped quite differently, and interestingly, West does not call the outside pitch against right-handers like he does with LHBs. That seems like a disadvantage to the pitcher when facing a RHB. What do you think?

By using a fixed-effects approach, we can see where the lefty-righty bias is coming from in umpire calls, and whether or not it is something across the entire population of umpires, or skewed by a select few discriminating against left-handed batters, whether it be because of stance or unconscious bias. If we have data on umpire handedness (something discussed recently at The Book Blog), this might give us some insight into how they favor their squat behind the plate. Mike Fast has also suggested that batter stance biases the umpires, so it may be interesting to find some data on how close batters (left vs. right) crowd the plate. Any takers?

(sorry they're not side-by-side...originally the post was like this, but Blogger decided to reformat things and I can't seem to get it to format correctly...though it depends on what computer screen you're looking at this with apparently).

CODE (using the Pretty R-Tool):

data <- read.csv(file="joe_west_called_pitches.csv", h=T)

head(data)

attach(data)

library(gam)

library(RColorBrewer)

display.brewer.all()

brewer.pal(11, "RdYlBu")

buylrd <- c("#313695", "#4575B4", "#74ADD1", "#ABD9E9", "#E0F3F8", "#FFFFBF", "#FEE090", "#FDAE61", "#F46D43", "#D73027", "#A50026")

library(gam)

####all batters

attach(data)

fit.gam <- gam(call_type ~ lo(px, span=.3*aspect.ratio, degree=1) + lo(pz, span=.3, degree=1), family=binomial(link="logit"))

myx.gam <- matrix(data=seq(from=-2, to=2, length=30), nrow=30, ncol=30)

myz.gam <- t(matrix(data=seq(from=0,to=5, length=30), nrow=30, ncol=30))

fitdata.gam <- data.frame(px=as.vector(myx.gam), pz=as.vector(myz.gam))

mypredict.gam <- predict(fit.gam, fitdata.gam, type="response")

mypredict.gam <- matrix(mypredict.gam,nrow=c(30,30))

png(file="WestAllGAMbrewercol.png", width=600, height=675)

filled.contour(x=seq(from=-2, to=2, length=30), y=seq(from=0, to=5, length=30), z=mypredict.gam, axes=T, zlim=c(0,1), nlevels=50,

color=colorRampPalette(buylrd),

main="Joe West Strike Zone Map (GAM Package)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)",

plot.axes={

axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")

axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")

rect(-0.708335, mean(data$sz_bot), 0.708335, mean(data$sz_top), border="black", lty="dashed", lwd=2)

},

key.axes={

ylim=c(0,1.0)

axis(4, at=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), labels=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), pos=1, las=0, col="black")

})

text(1.4, 2.5, "Probability of Strike Call", cex=1.1, srt=90)

dev.off()

##############righties

attach(right)

fit.gam.r <- gam(call_type ~ lo(px, span=.3*aspect.ratio, degree=1) + lo(pz, span=.3, degree=1), family=binomial(link="logit"))

myx.gam.r <- matrix(data=seq(from=-2, to=2, length=30), nrow=30, ncol=30)

myz.gam.r <- t(matrix(data=seq(from=0,to=5, length=30), nrow=30, ncol=30))

fitdata.gam.r <- data.frame(px=as.vector(myx.gam.r), pz=as.vector(myz.gam.r))

mypredict.gam.r <- predict(fit.gam.r, fitdata.gam.r, type="response")

mypredict.gam.r <- matrix(mypredict.gam.r,nrow=c(30,30))

png(file="WestRightGAMbrewercol.png", width=600, height=675)

filled.contour(x=seq(from=-2, to=2, length=30), y=seq(from=0, to=5, length=30), z=mypredict.gam.r, axes=T, zlim=c(0,1), nlevels=50,

color=colorRampPalette(buylrd),

main="Joe West Strike Zone Map (RHB, GAM Package)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)",

plot.axes={

axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")

axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")

rect(-0.708335, mean(data$sz_bot), 0.708335, mean(data$sz_top), border="black", lty="dashed", lwd=2)

},

key.axes={

ylim=c(0,1.0)

axis(4, at=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), labels=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), pos=1, las=0, col="black")

})

text(1.4, 2.5, "Probability of Strike Call", cex=1.1, srt=90)

dev.off()

###############lefties

attach(left)

fit.gam.l <- gam(call_type ~ lo(px, span=.3*aspect.ratio, degree=1) + lo(pz, span=.3, degree=1), family=binomial(link="logit"))

myx.gam.l <- matrix(data=seq(from=-2, to=2, length=30), nrow=30, ncol=30)

myz.gam.l <- t(matrix(data=seq(from=0,to=5, length=30), nrow=30, ncol=30))

fitdata.gam.l <- data.frame(px=as.vector(myx.gam.l), pz=as.vector(myz.gam.l))

mypredict.gam.l <- predict(fit.gam.l, fitdata.gam.l, type="response")

mypredict.gam.l <- matrix(mypredict.gam.l,nrow=c(30,30))

png(file="WestLeftGAMbrewercol.png", width=600, height=675)

filled.contour(x=seq(from=-2, to=2, length=30), y=seq(from=0, to=5, length=30), z=mypredict.gam.l, axes=T, zlim=c(0,1), nlevels=50,

color=colorRampPalette(buylrd),

main="Joe West Strike Zone Map (LHB, GAM Package)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)",

plot.axes={

axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")

axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")

rect(-0.708335, mean(data$sz_bot), 0.708335, mean(data$sz_top), border="black", lty="dashed", lwd=2)

},

key.axes={

ylim=c(0,1.0)

axis(4, at=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), labels=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), pos=1, las=0, col="black")

})

text(1.4, 2.5, "Probability of Strike Call", cex=1.1, srt=90)

dev.off()

Labels:
Baseball,
Heat Maps,
Pitch F/X,
R-project,
Sabermetrics

## Wednesday, December 8, 2010

### Interesting Posts at Rational Past Time Related to My Previous Strike Zone Map Post

J-Doug at Rational Pastime has some cool posts looking at umpire strike zones at his site (and cross-posted at Beyond the Boxscore). I was curious about this issue as well with some work I've been doing here in the office (which I'll refrain from talking about at this point).

Anyway, J-Doug looks at the strike zone size of RHB and LHB, concluding that lefties get the shaft (larger strike zone). Now, I only have Bruce Froemming's data in R right now, but I was curious if we would see anything different using A. Just Froemming for now and B. Using the GAM package, rather than a standard loess. Below is the 'gam' generated heat maps from my last post for LHB and RHB (LHB got deleted somehow when I posted, and I'm getting extremely frustrated with Blogger's posting options).

And here is a standard loess with some new smoothing parameters than I had in my last post:

Now a few caveats: first, I have not normalized the strike zone height, the box is simply the average zone for everyone in the dataset. So the fact that we see calls spread a little further above the strike zone for righties than for lefties may just mean lefties are overall taller than average (or it could just be some random noise). Secondly, this is only one umpire, while J-Doug has more than that. Lastly, I'm still in experimental mode with the gam models, so I could be totally off here.

Now to the interesting parts. Looking at the GAM model heat maps (the ones using the binomial assumption for the response) seem to show that the zone for right-handed batters is a little bigger than that for left-handed batters. In fact, this seems to be the case for both the standard loess and the gam package.

The main difference seems to be that the zone stretches further outside for lefties than it does for right handers. Right handers have to deal with more calls up and in and down and in than lefties apparently do (for Froemming that is).

I dunno. Just some observations. I haven't calculated confidence intervals or systematically chosen the span, but I made sure that for each of the pairs, the parameters for smoothing were the same to make them comparable. For the 'gam' model maps, I have a span of 0.5 and a first degree polynomial, while for the 'loess' model maps, I have a 0.7 span and a second degree polynomial. But the main issue is comparing RHB and LHB of each type.

So what does this mean? Well not too much. It could mean that Froemming doesn't follow the standard. It could mean that maybe using the 'gam' package is helpful in visualizing the true zone. Or it could mean that I didn't use the right parameters for my model(s). It certainly does not mean that J-Doug's conclusions are incorrect, but I'm curious how the results may look otherwise.

My Own Evidence Against Me:

Here is some evidence that the above plots (both the 'loess' and the 'gam') are incorrect from a visualization standpoint: I've also run a regression that indicates umpires as a whole are more likely to call strikes against left-handers, even after controlling for pitcher handedness, pitch location, pitch type, and a number of other factors. Another regression with respect to whether the call is 'correct' or not also tells me that umpires are more likely to make an incorrect call for left-handed batters at a rate of about 1.8%.

So in general, it sounds like J-Doug is right: left-handed batters are getting the shaft.

Finally, in general, if the above plots are visualizing the data correctly, Bruce Froemming goes against the grain when it comes to giving an advantage to Right Handed batters (I didn't run a separate regression for him).

Anyway, J-Doug looks at the strike zone size of RHB and LHB, concluding that lefties get the shaft (larger strike zone). Now, I only have Bruce Froemming's data in R right now, but I was curious if we would see anything different using A. Just Froemming for now and B. Using the GAM package, rather than a standard loess. Below is the 'gam' generated heat maps from my last post for LHB and RHB (LHB got deleted somehow when I posted, and I'm getting extremely frustrated with Blogger's posting options).

And here is a standard loess with some new smoothing parameters than I had in my last post:

Now a few caveats: first, I have not normalized the strike zone height, the box is simply the average zone for everyone in the dataset. So the fact that we see calls spread a little further above the strike zone for righties than for lefties may just mean lefties are overall taller than average (or it could just be some random noise). Secondly, this is only one umpire, while J-Doug has more than that. Lastly, I'm still in experimental mode with the gam models, so I could be totally off here.

Now to the interesting parts. Looking at the GAM model heat maps (the ones using the binomial assumption for the response) seem to show that the zone for right-handed batters is a little bigger than that for left-handed batters. In fact, this seems to be the case for both the standard loess and the gam package.

The main difference seems to be that the zone stretches further outside for lefties than it does for right handers. Right handers have to deal with more calls up and in and down and in than lefties apparently do (for Froemming that is).

I dunno. Just some observations. I haven't calculated confidence intervals or systematically chosen the span, but I made sure that for each of the pairs, the parameters for smoothing were the same to make them comparable. For the 'gam' model maps, I have a span of 0.5 and a first degree polynomial, while for the 'loess' model maps, I have a 0.7 span and a second degree polynomial. But the main issue is comparing RHB and LHB of each type.

So what does this mean? Well not too much. It could mean that Froemming doesn't follow the standard. It could mean that maybe using the 'gam' package is helpful in visualizing the true zone. Or it could mean that I didn't use the right parameters for my model(s). It certainly does not mean that J-Doug's conclusions are incorrect, but I'm curious how the results may look otherwise.

My Own Evidence Against Me:

Here is some evidence that the above plots (both the 'loess' and the 'gam') are incorrect from a visualization standpoint: I've also run a regression that indicates umpires as a whole are more likely to call strikes against left-handers, even after controlling for pitcher handedness, pitch location, pitch type, and a number of other factors. Another regression with respect to whether the call is 'correct' or not also tells me that umpires are more likely to make an incorrect call for left-handed batters at a rate of about 1.8%.

So in general, it sounds like J-Doug is right: left-handed batters are getting the shaft.

Finally, in general, if the above plots are visualizing the data correctly, Bruce Froemming goes against the grain when it comes to giving an advantage to Right Handed batters (I didn't run a separate regression for him).

Labels:
Heat Maps,
Pitch F/X,
R-project,
Statistics

## Sunday, December 5, 2010

### Rethinking 'loess' for Binomial-Response Pitch F/X Strike Zone Maps

So after a long hiatus, I'm back for today. I've been crazy busy with a number of different things--including getting engaged and helping plan out wedding dates and things of that sort--and unfortunately have not kept up here on this blog (or on Fantasy Ball Junkie, though we're waiting for the Fantasy Baseball Preseason to start up before posting much more). Teaching doesn't help, and I've been working on a few papers just submitted for review. Add in taking a class, doing some consulting work, and planning trips to Greece, Napa Valley, Canada, home to Maryland for X-mas and Thanksgiving, and possibly one to San Diego, The Prince of Slides has taken a backseat. Oh yeah, and I guess I should get writing that dissertation thing, but I'll procrastinate here for a bit...

Now that I'm done complaining, here's a relatively non-game-related post (it's more on assumptions behind our baseball data)...

I work a lot in R. If you've checked out previous posts here and at FBJ, almost all the analysis is exclusively done in R. I've been contemplating shifting my focus to how-to's for Sabermetrics with R. Unfortunately for you guys, I'm not Dave Allen, nor do I have as much time to devote to bringing you saber-type pieces. I've enjoyed the R-Bloggers site that feeds a number of these open-source types of blogs, and I think it's a great resource (hint, hint: go check it out). That's still up in the air, but either way, I think for any R-related posts, I am going to try and have the RSS Feed go there if they think I'm worthy of it.

Anyway, I had been working on improving my heat maps using a loess, rather than a 2-dimensional kernel density smoother. The plot scaling for smoothScatter does not work right and they are limited to density only--see my previous post or my IIATMS contribution for those plots. With loess , I could add a third variable to the mix represented by the heat colors (thanks to Dave Allen for his presentation from the Pitch F/X summit a couple years ago, as his Powerpoint slides sent me in the right direction to actually get 'filled.contour' to work in R--I still can't fathom as to why there isn't a simple 'plot.loess.2d' function in R that does this without the extra code and 'predict' data matrix creation).

I've been working with umpire data recently for strike zone maps (probability of a strike call), but these are applicable to pretty much any variable you want to plot with respect to location in the zone. But there's a problem/thought I came across (and I'll tell you what the true problem is a little later).

If you're not familiar with 'loess', it's just a regression, but it is not constrained to be a straight line or plane through the data. In other words, it allows wiggles to fit the data more closely. Of course, when using non-parametric methods, some things can be left up to the person running the analysis, one of them being the 'span' used for the loess smoothing.

Span is the width of the--smoothing or number of points/distance away depending on the type of function you're using--you can also use splines and nearest neighbor type smoothing methods--included in the smoother at each prediction point. In other words, if the span is 1 (in the context of how R defines it), then it will probably be smoothed too much. If the span is closer to zero, it becomes more (too) granular and doesn't give a smoothed representation of the strike zone. It's purely up to the researcher what the span will be and really depends on what you want to get out of the graph (though, there are some suggested rules of thumb and other ways to determine it 'optimally'). Essentially, you want to balance the fit with simplicity in presentation here.

What do I mean by 'smoothed too much'? Well, if you use the entire data set to calculate the average probability for every point, irrespective of it's location, the whole thing will be one color. On the other hand, if you use only each single data point to describe whether or not that pitch will be called a strike, you may as well just make a scatter plot. The idea is to fit the data well, but be able to summarize it more clearly and concisely.

I began using Dave Allen's code with the parameters he had chosen. If you look below, I have the strike zone maps for Bruce Froemming in 2007. This uses a standard loess smoother in R with a span on 0.6. But look closely at the borders of the Heat Map.

I have a hard time believing that the strike zone for an umpire looks like it does in these plots: higher probabilities of points well outside the strike zone than those closer to it. There seems to be a bit too much smoothing going on with my current span, so I fiddled around with it for a while and something hit me: for Strike Zone Maps, we don't have a normal distribution, but rather a binomial distribution. The pitch is either a strike (1) or a ball (0). While this isn't a huge problem in general--we can run linear probability models instead of probits/logits after all--it can be misleading when looking at strike zone maps (and you also have to go back and rescale your variable before you plot it, or you'll get probabilities above 1 and below 0).

(**SIDE NOTE: Dave's plots don't show this, from what I've seen, so the specifications he has in the presentation are almost certainly different than he uses for the strike zone maps I've seen. And I think his work is awesome. Finally, there are alternative solutions to fixing up the plots while still using a standard loess, but this post is simply to inform you of some other options).

Another option would be to reduce the span/increase the span (but then we'd either get a dot in the upper right OR make the strike zone in general look too large/smoothed out).

Finally, we could change the degree of the polynomial we use. The default is to use a second degree polynomial, which results in allowing more 'wiggle' in the loess. If we constrain it to 1, the plots look at little bit better. This is good to know, and it seems that generally, the loess smoothing works well if the parameters are chosen correctly.

Anyway, with a standard loess in R, here is what we see (span = 0.6, scaled for the ratio of the height vs. width, so the span for horizontal location is a little smaller):

Now, there are obviously a few outliers or problems in the data (see the one point in the top right below int he scatter plot), and that's likely why we get a little green at the bottom-left and top-right of the plots.

But remember we fit the original plots without bothering to think about the distribution of our data. Loess generally doesn't cover binomial data. Again, it may not be a huge deal, but it can affect some things we look at, and we'll have to bound/recode our data from 0 to 1 after we predict it (or rescaling it to 0, 1 which are actually different things--the latter would probably be preferred, but my code below does the former).

This means that we may not be getting a very good idea of the data points within the strike zone. If the ones right down the middle have a predicted probability of 1.40 using the loess, while the ones closer to the edges are at 1.10, we might want to know that these are not the same thing. But we artificially bound the data. Another option is to simply bound the color key on the right, but then points above 1 and below 0 are not included in our color coding! What to do?

Luckily, R has a package called 'gam' (Generalized Additive Models) that allows us to fit a loess regression using the binomial family and a logit link function similar to the glm package. What happens when we do this (using the same span of 0.6 for height and scaled accordingly for width)?

Here's what we get:

This seems like a much better-behaved strike zone map than the one using a standard loess regression. While in general, we see fairly similar results, there could be some misleading observations based on the family of distributions that we decide to model the data. Ultimately, the choice of your model and the parameters you stick in there will be important for your visualization, but I'm not sure it's make or break. I guess the thesis of this post is to make sure you think about your data first.

However, I'm curious what others think of using the 'gam' package for this. It looks like it makes the plot a perfect mirror image above and below the 2.5 foot line (draw a horizontal line there and flip it). I'm not certain why it does this, which makes me want to go back to the drawing board. I may try to simulate my own extremely non-symmetric data and see what happens along the horizontal axis. The last I heard, the 'gam' package was somewhat experimental, so perhaps using this is a bit out of date (anybody know). Feel free to take my code and fiddle with it! Change the degree, change the span, change the colors, then tell me mine are stupid and yours are way better!

I would love to hear suggestions about this. My first thought is that I'm missing something with the 'gam' package that results in the 'mirror image' look. So what do you think? Does the 'gam' package or the standard 'loess' give a better view of what's really going on in that scatter plot above? I'm happy if you want to tell me I'm totally wrong on this as well or of other packages that can do similar things for binomial data. For more continuous data, loess should work just fine (like run values, though that's technically not a continuous variable if we're using some category of run values like "Man on 1st, Two Outs, Single to Center Field" type of data...but that's getting nit-picky and I don't think it matters much). Like I said before, messing with the Span and changing the polynomial degree in the loess function generally makes things better. But we still want to model the data correctly, right?

Another possibility I'd like to play with is using smoothing splines to see if there are any differences. Both 'gam' and the 'mgcv' package, I think, allow you to do splines rather than loess. I imagine the results will be pretty much the same. And since we're not toooooo worried about testing hypotheses with this plot, I'm not extremely worried about the error distribution. The main worry is at the edges of our plot, where things may be overestimated without bounding the data with the logit function. Just something to keep in mind.

I have provided the code below for producing the maps using All of the data, keeping in mind that my data is structured from Mike Fast's code and I created a numeric representation of the 'type' variable called "call_type" where Strike=1 and Ball=1. You can always select the data specifically for RHB, or LHB, or pitch-type, or whatever you want! My data are pre-filtered in SPSS to only include Bruce Froemming's 2007 data for CALLED pitches.

Another request if anyone bothers to read this blog: I was wondering if there is an easy way to append catcher ids to the pitch database structure developed by Mike Fast (Mike, if you're out there, please help!). I looked through the XML files from the Gameday database, and it looks like they have Starting Catcher in the player file, but figuring out who is catching Pitch-by-Pitch is not as straight forward as for the pitcher and batters that are included int he inning files.

To do it from the XML, it looks like there would have to be a search in the play-by-play to see when a catcher was switched in or out, and I'm not sure it's even possible to set up some sort of macro to do this, because "Defensive Replacement" has a generic code, so only identifying catchers would be tedious. BUT, I know that there are people out there with catchers appended to their Pitch FX data, so maybe it's not as difficult as I'm making it out to be.

Now that I'm done complaining, here's a relatively non-game-related post (it's more on assumptions behind our baseball data)...

I work a lot in R. If you've checked out previous posts here and at FBJ, almost all the analysis is exclusively done in R. I've been contemplating shifting my focus to how-to's for Sabermetrics with R. Unfortunately for you guys, I'm not Dave Allen, nor do I have as much time to devote to bringing you saber-type pieces. I've enjoyed the R-Bloggers site that feeds a number of these open-source types of blogs, and I think it's a great resource (hint, hint: go check it out). That's still up in the air, but either way, I think for any R-related posts, I am going to try and have the RSS Feed go there if they think I'm worthy of it.

Anyway, I had been working on improving my heat maps using a loess, rather than a 2-dimensional kernel density smoother. The plot scaling for smoothScatter does not work right and they are limited to density only--see my previous post or my IIATMS contribution for those plots. With loess , I could add a third variable to the mix represented by the heat colors (thanks to Dave Allen for his presentation from the Pitch F/X summit a couple years ago, as his Powerpoint slides sent me in the right direction to actually get 'filled.contour' to work in R--I still can't fathom as to why there isn't a simple 'plot.loess.2d' function in R that does this without the extra code and 'predict' data matrix creation).

I've been working with umpire data recently for strike zone maps (probability of a strike call), but these are applicable to pretty much any variable you want to plot with respect to location in the zone. But there's a problem/thought I came across (and I'll tell you what the true problem is a little later).

If you're not familiar with 'loess', it's just a regression, but it is not constrained to be a straight line or plane through the data. In other words, it allows wiggles to fit the data more closely. Of course, when using non-parametric methods, some things can be left up to the person running the analysis, one of them being the 'span' used for the loess smoothing.

Span is the width of the--smoothing or number of points/distance away depending on the type of function you're using--you can also use splines and nearest neighbor type smoothing methods--included in the smoother at each prediction point. In other words, if the span is 1 (in the context of how R defines it), then it will probably be smoothed too much. If the span is closer to zero, it becomes more (too) granular and doesn't give a smoothed representation of the strike zone. It's purely up to the researcher what the span will be and really depends on what you want to get out of the graph (though, there are some suggested rules of thumb and other ways to determine it 'optimally'). Essentially, you want to balance the fit with simplicity in presentation here.

What do I mean by 'smoothed too much'? Well, if you use the entire data set to calculate the average probability for every point, irrespective of it's location, the whole thing will be one color. On the other hand, if you use only each single data point to describe whether or not that pitch will be called a strike, you may as well just make a scatter plot. The idea is to fit the data well, but be able to summarize it more clearly and concisely.

I began using Dave Allen's code with the parameters he had chosen. If you look below, I have the strike zone maps for Bruce Froemming in 2007. This uses a standard loess smoother in R with a span on 0.6. But look closely at the borders of the Heat Map.

I have a hard time believing that the strike zone for an umpire looks like it does in these plots: higher probabilities of points well outside the strike zone than those closer to it. There seems to be a bit too much smoothing going on with my current span, so I fiddled around with it for a while and something hit me: for Strike Zone Maps, we don't have a normal distribution, but rather a binomial distribution. The pitch is either a strike (1) or a ball (0). While this isn't a huge problem in general--we can run linear probability models instead of probits/logits after all--it can be misleading when looking at strike zone maps (and you also have to go back and rescale your variable before you plot it, or you'll get probabilities above 1 and below 0).

(**SIDE NOTE: Dave's plots don't show this, from what I've seen, so the specifications he has in the presentation are almost certainly different than he uses for the strike zone maps I've seen. And I think his work is awesome. Finally, there are alternative solutions to fixing up the plots while still using a standard loess, but this post is simply to inform you of some other options).

Another option would be to reduce the span/increase the span (but then we'd either get a dot in the upper right OR make the strike zone in general look too large/smoothed out).

Finally, we could change the degree of the polynomial we use. The default is to use a second degree polynomial, which results in allowing more 'wiggle' in the loess. If we constrain it to 1, the plots look at little bit better. This is good to know, and it seems that generally, the loess smoothing works well if the parameters are chosen correctly.

Anyway, with a standard loess in R, here is what we see (span = 0.6, scaled for the ratio of the height vs. width, so the span for horizontal location is a little smaller):

Now, there are obviously a few outliers or problems in the data (see the one point in the top right below int he scatter plot), and that's likely why we get a little green at the bottom-left and top-right of the plots.

But remember we fit the original plots without bothering to think about the distribution of our data. Loess generally doesn't cover binomial data. Again, it may not be a huge deal, but it can affect some things we look at, and we'll have to bound/recode our data from 0 to 1 after we predict it (or rescaling it to 0, 1 which are actually different things--the latter would probably be preferred, but my code below does the former).

This means that we may not be getting a very good idea of the data points within the strike zone. If the ones right down the middle have a predicted probability of 1.40 using the loess, while the ones closer to the edges are at 1.10, we might want to know that these are not the same thing. But we artificially bound the data. Another option is to simply bound the color key on the right, but then points above 1 and below 0 are not included in our color coding! What to do?

Luckily, R has a package called 'gam' (Generalized Additive Models) that allows us to fit a loess regression using the binomial family and a logit link function similar to the glm package. What happens when we do this (using the same span of 0.6 for height and scaled accordingly for width)?

Here's what we get:

This seems like a much better-behaved strike zone map than the one using a standard loess regression. While in general, we see fairly similar results, there could be some misleading observations based on the family of distributions that we decide to model the data. Ultimately, the choice of your model and the parameters you stick in there will be important for your visualization, but I'm not sure it's make or break. I guess the thesis of this post is to make sure you think about your data first.

However, I'm curious what others think of using the 'gam' package for this. It looks like it makes the plot a perfect mirror image above and below the 2.5 foot line (draw a horizontal line there and flip it). I'm not certain why it does this, which makes me want to go back to the drawing board. I may try to simulate my own extremely non-symmetric data and see what happens along the horizontal axis. The last I heard, the 'gam' package was somewhat experimental, so perhaps using this is a bit out of date (anybody know). Feel free to take my code and fiddle with it! Change the degree, change the span, change the colors, then tell me mine are stupid and yours are way better!

I would love to hear suggestions about this. My first thought is that I'm missing something with the 'gam' package that results in the 'mirror image' look. So what do you think? Does the 'gam' package or the standard 'loess' give a better view of what's really going on in that scatter plot above? I'm happy if you want to tell me I'm totally wrong on this as well or of other packages that can do similar things for binomial data. For more continuous data, loess should work just fine (like run values, though that's technically not a continuous variable if we're using some category of run values like "Man on 1st, Two Outs, Single to Center Field" type of data...but that's getting nit-picky and I don't think it matters much). Like I said before, messing with the Span and changing the polynomial degree in the loess function generally makes things better. But we still want to model the data correctly, right?

Another possibility I'd like to play with is using smoothing splines to see if there are any differences. Both 'gam' and the 'mgcv' package, I think, allow you to do splines rather than loess. I imagine the results will be pretty much the same. And since we're not toooooo worried about testing hypotheses with this plot, I'm not extremely worried about the error distribution. The main worry is at the edges of our plot, where things may be overestimated without bounding the data with the logit function. Just something to keep in mind.

I have provided the code below for producing the maps using All of the data, keeping in mind that my data is structured from Mike Fast's code and I created a numeric representation of the 'type' variable called "call_type" where Strike=1 and Ball=1. You can always select the data specifically for RHB, or LHB, or pitch-type, or whatever you want! My data are pre-filtered in SPSS to only include Bruce Froemming's 2007 data for CALLED pitches.

Another request if anyone bothers to read this blog: I was wondering if there is an easy way to append catcher ids to the pitch database structure developed by Mike Fast (Mike, if you're out there, please help!). I looked through the XML files from the Gameday database, and it looks like they have Starting Catcher in the player file, but figuring out who is catching Pitch-by-Pitch is not as straight forward as for the pitcher and batters that are included int he inning files.

To do it from the XML, it looks like there would have to be a search in the play-by-play to see when a catcher was switched in or out, and I'm not sure it's even possible to set up some sort of macro to do this, because "Defensive Replacement" has a generic code, so only identifying catchers would be tedious. BUT, I know that there are people out there with catchers appended to their Pitch FX data, so maybe it's not as difficult as I'm making it out to be.

Code (please excuse the crappy Blogger formatting, I'm not an HTML guy so I don't have much control over this and the standard text editor does not let me indent code (ARGH!); this is for the general plots, but the RHB/LHB work just the same, you just need to subset the data):

Correction: found a way to insert colored R code using "Pretty R" html. Just copy and paste, and you're good to go. Still not perfect below (the html tool left out a couple parenthesis, so I'm sorry about that), but it should be helpful when trying to read it!

Correction: found a way to insert colored R code using "Pretty R" html. Just copy and paste, and you're good to go. Still not perfect below (the html tool left out a couple parenthesis, so I'm sorry about that), but it should be helpful when trying to read it!

ump <- read.csv(file="umpire_10.csv", h=T)

head(ump)

attach(ump)

#use loess and filled.contour

sz.width <- 0.708335 - (-0.708335)

sz.height <- mean(ump$sz_top)- mean(ump$sz_bot)

aspect.ratio <- (max(px)-min(px))/(max(pz)-min(pz))

fit <- loess(call_type ~ px + pz, span=c(0.5*aspect.ratio, 0.5), degree=1)

myx <- matrix(data=seq(from=-2, to=2, length=30), nrow=30, ncol=30)

myz <- t(matrix(data=seq(from=0,to=5, length=30), nrow=30, ncol=30))

fitdata <- data.frame(px=as.vector(myx), pz=as.vector(myz))

mypredict <- predict(fit, fitdata)

mypredict <- ifelse(mypredict > 1, 1, mypredict)

mypredict <- ifelse(mypredict <0, 0, mypredict)

mypredict <- matrix(mypredict,nrow=c(30,30))

png(file="FroemmingAll.png", width=600, height=675)

filled.contour(x=seq(from=-2, to=2, length=30), y=seq(from=0, to=5, length=30), z=mypredict, zlim=c(0,1), nlevels=50,color=colorRampPalette(c("darkblue", "blue4", "darkgreen", "green4", "greenyellow", "yellow", "gold", "orange", "darkorange", "red", "darkred")), main="Bruce Froemming Strike Zone Map", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)",

plot.axes={

axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")

axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")

rect(-0.708335, mean(ump$sz_bot), 0.708335, mean(ump$sz_top), border="black",

lty="dashed", lwd=2)

},

key.axes={

ylim=c(0,1.0)

axis(4, at=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), labels=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), pos=1, las=0,

col="black")

})

text(1.4, 2.5, "Probability of Strike Call", cex=1.1, srt=90)

dev.off()

############trying to use the 'gam' package instead of loess because data is binary

library(gam)

####all batters

attach(ump)

fit.gam <- gam(call_type ~ lo(px, span=.5*aspect.ratio, degree=1) + lo(pz, span=.5, degree=1), family=binomial(link="logit"))

myx.gam <- matrix(data=seq(from=-2, to=2, length=30), nrow=30, ncol=30)

myz.gam <- t(matrix(data=seq(from=0,to=5, length=30), nrow=30, ncol=30))

fitdata.gam <- data.frame(px=as.vector(myx.gam), pz=as.vector(myz.gam))

mypredict.gam <- predict(fit.gam, fitdata.gam, type="response")

mypredict.gam <- matrix(mypredict.gam,nrow=c(30,30))

png(file="FroemmingAllGAM.png", width=600, height=675)

filled.contour(x=seq(from=-2, to=2, length=30), y=seq(from=0, to=5, length=30), z=mypredict.gam, axes=T, zlim=c(0,1), nlevels=50,

color=colorRampPalette(c("darkblue", "blue4", "darkgreen", "green4", "greenyellow", "yellow", "gold", "orange", "darkorange", "red", "darkred")),

main="Bruce Froemming Strike Zone Map (GAM Package)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)",

plot.axes={

axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")

axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")

rect(-0.708335, mean(ump$sz_bot), 0.708335, mean(ump$sz_top), border="black",

lty="dashed", lwd=2)

},

key.axes={

ylim=c(0,1.0)

axis(4, at=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), labels=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), pos=1, las=0,

col="black")

})

text(1.4, 2.5, "Probability of Strike Call", cex=1.1, srt=90)

dev.off()

Subscribe to:
Posts (Atom)