Thursday, July 18, 2013

Advanced sab-R-metrics: Parallelization with the 'mgcv' Package

Carson Sievert (creator of the really neat pitchRx package) and Steamer Projections posed a question about reasonable run times of the mgcv package on large data in R yesterday, and promised my Pitch F/X friends I would post here with a quick tip on speeding things up.  Most of this came from Google searching, so I do not claim to be the original creator of the code used below.

A while back, I posted on using Generalized Additive Models instead of the basic loess package in R to fit Pitch F/X data, specifically with a binary dependent variable (for example, probability of some pitch being called a strike).  Something went weird with the gam package, so I switched over to the mgcv package, which has now provided the basis of analysis for my 2 most recent academic publications.  I like to fancy this the best way to work with binary F/X data--but I am biased.  Note that the advantages of the mgcv package can also be leveraged in fitting other data distributions besides the binomial.  This includes the negative binomial distribution, which can be more helpful for data that are zero-inflated (probably most of the other binomial data we want to model in baseball pitches).

One advantage of mgcv is that it uses generalized cross-validation in order to estimate the smoothing parameter.  Why is this important?  Well, because we have different sample sizes when making comparisons--for example, across umpire strike zones--and we also have different variances, we might not want to fit each one the same.  Additionally, smoothing by looking at the plot until it "looks good" can create biases.  Therefore, this allows a more objective way to fit the data.  I also like the ability to fit interactions of the vertical and horizontal location variables.  If you fit them separately and additively, you end up missing out on some of the asymmetries of the strike zone.  These ultimately tend to be pretty important, with the zone tipping down and away from the batter (See the Appendix for comparison code, see below for picture of tilt; figure from this paper).

One thing that I did note on Twitter is that for the binary models, a larger sample size tends to be necessary.  My rule of thumb--by no means a hard line--is that about 5,000 pitches are necessary to make a reasonable model of the strike zone.  The is close to the bare minimum without having things go haywire and look funky like the example below, but depending on the data you might be able to fit a model with fewer observations. 

Also, if you know a good way to integrate regression to the mean in some sort of Bayesian fashion in these models, that might help (simply some weighted average of all umpire calls and the pitches called by Umpire X that does not have enough experience yet).

Because R tends to work on a single thread, instead of using all the cores on your computer, the models can become rather cumbersome.  Believe me, I know.  For a while, I was fitting models with 1.3 million pitches, 125 dummy fixed effects, and some 30 other control variables at a time for this paper.  It took anywhere from 1-3 hours, depending on whether my computer felt happy that day--and I kept forgetting to include a variable here, change something there, etc.

OK, so parallelization.  It's actually incredibly easy in the mgcv package.  You first want to know if your computer has multiple cores, and if so, how many.  You can do this through the code below (note that I first load all the necessary packages for what I want to do):

###load libraries
###see if you have multiple cores
###indicate number of cores used for parallel processing
if (detectCores()>1) {
cl <- makeCluster(detectCores()-1)
} else cl <- NULL
Created by Pretty R at

That last 'cl' just tells you how many cores you will be using.  Note that this leaves one of your cores ready for processing other things.  You can use all of them, but it could end up keeping you from being able to do anything else on your computer while your model is running.  You can also use less.  Simply change the second line from '-1' to '-2', or whatever you want to do.  From here, mgcv has a single command for using multiple cores.  You'll want to use the 'cl' designation as the cores to use.

One should also note that, in R, large data sets and massive matrix inversions take up a significant amount of RAM.  When I came to Florida I had to convince our IT people that I needed at least 32 GB of RAM, specifically to run the models in the paper linked above.  Running the single model got me up to 8-10 GB, while doing multiple models in a single instance in R subsequently maxed me out at around 28 GB before I closed R and opened another instance.  This is a limitation that can be addressed to some extent with mgcv, but if you're not running every single pitch available in the F/X database, you probably won't have to worry about this. 

In case you do, mgcv also has a nice option that breaks the data up into chunks and has a much lower memory footprint.  It is called bam() and works just as the gam() function does, but allows analysis on larger data sets when you have more limited memory by breaking it into chunks.  The help file claims that it can work much faster on its own in addition to saving memory.  And--most relevant to this post--this is the function that includes the option to parallelize your analyses.  The code is exactly the same with the extra command using our 'cl' defined above.  Note that I use the combined smooth and limit the degrees of freedom of the smooth to 50.  Those are, of course, choices of the modeler and dependent on the type of data you are analyzing.

###fit your model
strikeFit1 <- bam(strike_call ~ s(px, pz, k=51), method="GCV.Cp", data=called, 
   family=binomial(link="logit"), cluster=cl)
Created by Pretty R at

Boom.  That's it.  You can also consider fitting smooths based on handedness.  You can do one for each type of batter by breaking up the data and the modeling, or you can do the following below:

###fit model while controlling for batter handedness
strikeFit2 <- bam(strike_call ~ s(px, pz, by=factor(batter_stand), k=51) + factor(batter_stand), 
   method="GCV.Cp", data=called, family=binomial(link="logit"), cluster=cl)
Created by Pretty R at

And of course you can add covariates to your model that you want to estimate parametrically, such as the impact of count or pitch type:

###fit model controlling for batter handedness, count, and pitch type
strikeFit3 <- bam(strike_call ~ s(px, pz, by=factor(batter_stand), k=51) + factor(batter_stand) + 
   factor(pitch_type) + factor(BScount), method="GCV.Cp", data=called, family=binomial(link="logit"), cluster=cl)
Created by Pretty R at

With the model, creating figures is as easy as using the predict() function and using code as I have shown here before.  And, thanks to Carson, much of the figure production is now automated in the pitchRx package.

Note that much of my reading about this package comes from an excellent book by its creator, Simon Wood, called Generalized Additive Models: An Introduction with R.  If these models are interesting to you, this is a must have resource.

Appendix: The reason I use the interaction term is that the UBRE score is significantly better by doing so, as suggested in the previously cited text.  The code to compare the two models is also included below.  Note that your variable names and data name may differ, so change accordingly:
###Model with separate smooths
fit <- bam(strike_call ~ s(px, k=51) + s(pz, k=51), method="GCV.Cp", data=called, 
   family=binomial(link="logit"), cluster=cl)
###Model with combined smooth
fit.add <- bam(strike_call ~ s(px, pz, k=51), method="GCV.Cp", 
   data=called, familiy=binomial(link="logit"), cluster=cl)
###combined smooth UBRE score is lower
###compare models with Wald test
anova(fit, fit.add)
Created by Pretty R at

Friday, July 5, 2013

What is Big Data, anyway?

My graduate school advisor, Rod Fort, posed the question in this post's title on Twitter today.  I gave answers and, as he usually does, he made me think more about my answers and their precision.  Technically, what I was trying to get across was that the use of Big Data, in most cases, is terribly imprecise.  I should have been able to explain the use of the term quickly, but it took a while and a number of "well, we've always done that" from Rod.  It is thrown around a lot, and in most cases not in any meaningful way.  I got a similar reaction to my mention of a prospective certificate in Complex Systems while at Michigan (which I did not pursue--mainly because my mathematical background wasn't strong enough and I had time constraints pursuing other things).

So, assuming we want to separate the use of "Big Data" with "Analytics", I think we can amply sum up the term with the following:

Big Data describes the relationship between the ability to collect data, and the ability to do something with it.  Data is BIG at the margin at which one more unit of data would leave us unable to analyze it all with the given technological capability.

This leaves Big Data flexible for the given tool.  The growth to collect and store large amounts of data has outpaced the ability to do anything meaningful with it.  This isn't anything new.  In the same way that dynamic pricing isn't really a new idea, just a new implementation.  In the same way that analytics aren't new, just a clear recognition of the integration between statisticians, programmers, and managers in the use of the term today.  In the same way that Moneyball, the idea, isn't new.  All tend to improve over time just as any field.

When it comes to analysis of Big Data--not the term big data itself--the holy grail is to have the ability to push a button, and have the answer directly to the decision maker, what I called "streamlining" on Twitter.  But this isn't Big Data itself (and it's really a fantasy at its extreme).  Certainly we can get closer to this, but data changes, behavior changes, the world changes.  These will always have to be updated, and in many ways I don't know that Big Data and Analytics as terms are completely separable.  In this case, though, let's be specific:

Analytics is the pursuit of simplistic, streamlined statistical information in a context understandable to the decision maker.

Again, unless we believe the movie Paycheck, this won't be 100% possible.  But the fantasy idea is that the computer and its data will tell us the answer to everything.  I enjoy this quote from the Big Data article linked above:  

"May 2012 danah boyd and Kate Crawford publish “Critical Questions for Big Data” in Information, Communications, and Society...(3) Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.”

However, we can do things to ease the use of large amounts of information to make decisions.  This requires the cooperation of statisticians, programmers, and managers.  Managers need to pose the problem in a way that is tractable and understandable.  Statisticians need to know the best methods for the distribution and variability in some given set of information, and be able to communicate this back to the manager.  The programmer needs to be able to collect the data accurately--possibly with the help of many other experts--and deliver the methodology in a way that can happen in real time or as close to it as possible.  In many ways, Dynamic Pricing is an outcome of these things--but not a new idea.  Big Data commentaries and discussions are referring to closing the gap between availability and implementation.