Anyway, I had been doing some work in the office with Pitch F/X data, just trying to map out the strike zone for all umpires in the aggregate. This was a bit slow on the computer (it's about 400,000 called pitches for 2010) so I decided to run a program for generating multiple heat maps on my new computer back home. But there was a problem.

For data sets that are large enough (usually at least 5,000 observations with this type of data) I like to use a GAM model (the 'gam' package) for estimating the probability of a strike call in a given 2-dimensional location. I have shown these before, and I use "filled.contour()" in order to generate a heat map using this output (for those in the saber world, it's an adaptation of Dave Allen's presentation at the Pitch F/X summit in 2009). There are difficulties with using the correct bandwidth, etc. with these models and the visualizations can get tricky, but I won't visit this issue right now, as this should be beside the point for this post.

The problem seems to stem from something else, and the only thing different seems to be the computer the code is run on (and the R version). Let's start simple. Below I map out every called strike from 2010 just using a scatter plot. This plot turns out the same on both computers (keep in mind I used the EXACT same script for both), so no problem here.

When I used my code at the office in order to map the probability onto a 2-D space, I got the perfectly reasonable solution shown below:

However, when I did this on my home computer, things got weird. I've gone over it again and again and can't figure out what is going on. But when I run the identical code on my personal laptop computer, I end up with the following:

It's almost as if the axes are flipped. But even with the same bandwidth and same data, the latter visual seems to have a slightly larger predicted strike zone.

So I guess my question is: Has anyone else run into this sort of problem? And would this be a problem with R or the 'gam' package (I'm not sure if I have different 'gam' package versions on my computers, but both were installed within the last 6 months or so)?

I think it's pretty obvious that the latter model is not a good representation of the data. But my hope was to use my new computer to run big R projects that I have, so I want to be sure whatever I do on it is not a mess. If anyone has any suggestions, I'd be grateful for them. I hope this isn't wasting anyone's time, but I'm stumped.

I have the code below (you can find a smaller version of the data set at Joe Lefkowitz's site on the sidebar). Keep in mind that I haven't yet attempted to choose any sort of optimal bandwidth for the given data, I'm just 'eyeballing it'. But the bandwidth does not seem to affect the flipped axis representation on my new version of R.

###########make color pallette

library(RColorBrewer)

brewer.pal(11, "RdYlBu")

buylrd <- c("#313695", "#4575B4", "#74ADD1", "#ABD9E9", "#E0F3F8", "#FFFFBF", "#FEE090", "#FDAE61", "#F46D43", "#D73027", "#A50026")

####all umpires

umpcall <- subset(pitches, pitches$called_by_ump==1)

umpcall$call_type <- ifelse(umpcall$type=="Strike", 1, 0)

head(umpcall)

###standard plot for strike zone

##NOTE: tends to make axes wider than heat maps (png width/height issue)

##because there is not a key, but no big deal here

png(file="truestrikes.png", height=675, width=600)

plot(umpcall$pz[umpcall$call_type==1] ~ umpcall$px[umpcall$call_type==1], col="darkgreen",

xlab="Horizontal Location", ylab="Vertical Location", main="Strike Location", xlim=c(-2,2), ylim=c(0,5))

rect(-0.708335, mean(umpcall$sz_bot), 0.708335, mean(umpcall$sz_top), border="black", lty="dashed", lwd=2)

dev.off()

###############################GAM VERSION

library(gam)

fit.gam <- gam(call_type ~ lo(px, span=.1, degree=1) + lo(pz, span=.1, degree=1), family=binomial(link="logit"), data=umpcall)

myx.gam <- matrix(data=seq(from=-2, to=2, length=100), nrow=100, ncol=100)

myz.gam <- t(matrix(data=seq(from=0,to=5, length=100), nrow=100, ncol=100))

fitdata.gam <- data.frame(px=as.vector(myx.gam), pz=as.vector(myz.gam))

mypredict.gam <- predict(fit.gam, fitdata.gam, type="response")

mypredict.gam <- matrix(mypredict.gam, nrow=c(100,100))

png(file="2010AllStrikeZoneGAMb.png", width=600, height=675)

filled.contour(x=seq(from=-2, to=2, length=100), y=seq(from=0, to=5, length=100), z=mypredict.gam, axes=T, zlim=c(0,1),

nlevels=50, color=colorRampPalette(buylrd), main="2010 Strike Zone Map (GAM Package)", xlab="Horizontal Location (ft.)",

ylab="Vertical Location (ft.)",

plot.axes={

axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")

axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")

rect(-0.708335, mean(umpcall$sz_bot), 0.708335, mean(umpcall$sz_top), border="black", lty="dashed", lwd=2)

},

key.axes={

ylim=c(0,1.0)

axis(4, at=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), labels=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), pos=1, las=0, col="black")

})

text(1.4, 2.5, "GAM Model Probability of Strike Call", cex=1.1, srt=90)

dev.off()

Millsy

ReplyDeleteCouldn't find the data link you were talking about. If you mail me a cut of your data I am happy to take a look.

jrbeamer at gmail dot com