Thursday, March 3, 2011

Follow Up to Comments to Mike Fast at B-Pro



Mike Fast has an interesting article up at Baseball Prospectus regarding the accuracy of BIS data. However, I had some suspicions that BIS was actually modeling something different than their press release/blog actually said. With some comments from Mike, I grabbed data from Joe Lefkowitz's Pitch F/X site (an awesome place btw) and produced a map in R.

It looks like, despite the fact that BIS explained the strikeout as densities at pitch locations, it's actually a heat map of the probability of striking out in a 2-strike count, given the location of the pitch. That makes using density maps tough as comparisons (as Mike used for the article). Below, I use a simple loess function (rather than gam, which is bad juju on data sets this small from my experiences), and it reproduces something very close to the BIS plot that Mike highlights. It is VERY IMPORTANT to keep in mind that I did not rescale the legend on the side to a 0-1 probability scale. That's okay, just don't pay attention to the numbers. The colors should still represent the correct view of the data under this type of model.



One more point: modeling probabilities of striking out with such small data isn't great practice IMHO. You can get some wonky stuff. But it looks like that may be what BIS is doing in their new product, rather than plotting density (frequency) of pitch locations in certain situations.


ADDENDUM: Here is the Carlos Pena map using similar parameters (a little less smoothing here than in the Crawford plot, but it doesn't affect too much...remember that the scale on the right is not very useful, so don't pay attention to the absolute numbers there).


From this, the hot zones again look similar to the BIS plots. I have no clue what their scale is, and if it really is the probability of a strike, they obviously use a different smoothing parameter from what I do here. Again, I'm on vacation in Napa Valley, so I haven't had a lot of time to get into the nitty gritty of the data. But I think these plots are somewhat conclusive on the fact that BIS isn't plotting density. If it is the case that they're not plotting density, then their explanation is not very good, as Mike links to in the comments of his article. Makes it tough to decipher what's going on.


R-Code By Request:
setwd("c:/Users/Brian/Dropbox/Blog Stuff/Mike Fast Check")

craw <- read.csv(file="crawford.csv", h=T)
head(craw)

crawb <- subset(craw, craw$strikes==2)

crawb$kout <- ifelse(crawb$Result.Type=="S" & crawb$Atbat.Result=="Strikeout" & crawb$Pitch.Result!="Foul", 1, 0)

#fitting loess regression and plotting it using a contour
attach(crawb)

fit <- loess(kout ~ px + pz, span=.3)
myx <- matrix(data=seq(from=-2, to=2, length=100), nrow=100, ncol=100)
myz <- t(matrix(data=seq(from=0,to=5, length=100), nrow=100, ncol=100))
fitdata <- data.frame(px=as.vector(myx), pz=as.vector(myz))
mypredict <- predict(fit, fitdata, type="response")
mypredict <- matrix(mypredict,nrow=c(100,100))

png(file="crawfordKs.png", width=600, height=675)
filled.contour(x=seq(from=-2, to=2, length=100), y=seq(from=0, to=5, length=100), z=mypredict, nlevels=50,
color=colorRampPalette(c("darkblue", "blue4", "darkgreen", "green4", "greenyellow", "yellow", "gold", "orange", "darkorange", "red", "darkred")),
main="Carl Crawford Strikeout Rate", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)",
plot.axes={
axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")
axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")
rect(-0.708335, mean(crawb$sz_bot), 0.708335, mean(crawb$sz_top), border="black", lty="dashed", lwd=2)
},
key.axes={
ylim=c(0,1.0)
axis(4, at=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), labels=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), pos=1, las=0, col="black")
})
text(1.4, 2.5, "Probability of Strike Call", cex=1.1, srt=90)
dev.off()

pena <- read.csv(file="pena.csv", h=T)
head(pena)

penab <- subset(pena, pena$strikes==2)

penab$kout <- ifelse(penab$Result.Type=="S" & penab$Atbat.Result=="Strikeout" & penab$Pitch.Result!="Foul", 1, 0)

#fitting loess regression and plotting it using a contour
attach(penab)

fit <- loess(kout ~ px + pz, span=.3)
myx <- matrix(data=seq(from=-2, to=2, length=100), nrow=100, ncol=100)
myz <- t(matrix(data=seq(from=0,to=5, length=100), nrow=100, ncol=100))
fitdata <- data.frame(px=as.vector(myx), pz=as.vector(myz))
mypredict <- predict(fit, fitdata, type="response")
mypredict <- matrix(mypredict,nrow=c(100,100))

png(file="penaKs.png", width=600, height=675)
filled.contour(x=seq(from=-2, to=2, length=100), y=seq(from=0, to=5, length=100), z=mypredict, nlevels=50,
color=colorRampPalette(c("darkblue", "blue4", "darkgreen", "green4", "greenyellow", "yellow", "gold", "orange", "darkorange", "red", "darkred")),
main="Carlos Pena Strikeout Rate", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)",
plot.axes={
axis(1, at=c(-2,-1,0,1,2), pos=0, labels=c(-2,-1,0,1,2), las=0, col="black")
axis(2, at=c(0,1,2,3,4,5), pos=-2, labels=c(0,1,2,3,4,5), las=0, col="black")
rect(-0.708335, mean(penab$sz_bot), 0.708335, mean(penab$sz_top), border="black", lty="dashed", lwd=2)
},
key.axes={
ylim=c(0,1.0)
axis(4, at=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), labels=c(0,.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0), pos=1, las=0, col="black")
})
text(1.4, 2.5, "Probability of Strike Call", cex=1.1, srt=90)
dev.off()

Created by Pretty R at inside-R.org


7 comments:

  1. Agreed, Millsy. I find that once you're under 1000 called pitches the curves get finicky, and under 500 is right out.

    If your sample size is small enough, just use a gaussian or similar smoothing function (which is essentially what kernel densities do). Ideally, the halfwidth of the kernel density is set so that the distribution of probabilities is equal to the uncertainty of the pitchf/x data (which is about 1 in.). Anything more is smoother than it needs to be, anything smooth is assuming a level of accuracy that isn't there.

    ReplyDelete
  2. Yeah. I've found 3 or 4 outliers in the umpire data with about 2,500 pitches even has a large-ish effect.

    I had a comment a while ago when contemplating using the GAM package in R for modeling probabilities from a statistician at UBC working on robust gam estimation. I think he's messing with the Bruce Froemming data set for some of his work, and I am interested to see what comes of that.

    ReplyDelete
  3. Millsy - great stuff. Any chance you can post the R code for this?

    ReplyDelete
  4. I'm happy to post it when I get home from vacation. However, it pretty much uses code directly from here:

    http://princeofslides.blogspot.com/2010/12/rethinking-loess-for-binomial-response.html

    ReplyDelete
  5. Thanks for adding the code

    ReplyDelete
  6. No problem. Sorry about the crazy extra spacing everywhere. The Pretty R tool seems to not work well with the "<-" assignment operator (I assume because that has some meaning in HTML).

    I'm not any sort of programmer outside of playing in R, so attempting to fix HTML is outside of my niche.

    ReplyDelete
  7. Just realized an error in the code. The key should NOT say "Probability of Strike Call", but should be "Probability of a Strikeout".

    I had adapted this from my Umpire plots, and forgot to change that. Just want to give a heads up!

    ReplyDelete