Tuesday, May 31, 2011

sab-R-metrics: GIF Movies and Pitch Flights (Guest Post)

A couple weeks ago, I received an email from a fellow Pitch F/Xer and R-User, Josh Weinstock, asking if I was interested in a guest post here at Prince of Slides. I didn't think I was important enough to have talented guests posting at my blog; however, Josh pointed out that this site tends to be the place for those who are part of a niche within a niche (i.e. Sabermetrics with R), and that it would be a great place to showcase some of his own work.

Josh currently contributes to It's About the Money, a Yankee themed ESPN Sweetspot blog. He hails from North Carolina as a die-hard baseball fan. He welcomes discussion of baseball, punk music, and funny tv shows. You can reach him at josh82093 at gmail dot com or on twitter @J__Stock (two underscores). He has posted some pretty cool stuff like this Robinson Cano GIF image made in R. Given his interesting posts and talent in R, I figured he would make a fantastic first ever Guest Post here at the site. Here is what Josh has to say:

Pitching is complicated. In order to be successful, pitchers must have velocity, movement, location, and deception. Thanks to pitch f/x data, the first three are pretty easy to study. In fact, these three variables are more or less directly recorded for every pitch thrown in major league baseball. However, deception remains somewhat of a mystery. This is mainly because we don't really understand how deception works. However, one subset of deception is pretty easy to quantify: pitch flights. Pitch flights allow us a glimpse into how the batter actually sees the ball. This is just one more piece of data that we can use to help understand the mystery of pitching.

The following tool is intended to help add pitch flight visualizations to your analysis. Of particular importance is the recognizability of breaking balls (size of the "hump") in relation to the fastball. The time of .075 seconds after the ball is released is also important, as this is the time that Robert Adair (author of the physics of baseball) hypothesized that batters need to decide whether or not to swing. And the graphs are kind of cool.

Before you start, you need to install the XML and animation packages. You also need a basic knowledge of R, though if you read this website I'm sure you're prepared. If you have trouble with the tool, feel free to ask for help through email ( josh82093 at gmail dot com ) or twitter ( J__Stock ).


So from this, we have some R-code and a really cool (not just kind of cool) function that will plot pitch flights in an animated fashion. While you'll want to have R experience before using this, the function is extremely user friendly. All you need to know is how to set a working directory and the URL of the pitch data for your desired pitcher.

You can download Josh's code by clicking here. This is the guts of the function, and you'll need to use it in multiple steps. The advantage of using this is that it will give a bit more flexibility for experienced R users. As Josh said, you'll need to install the packages "XML" and "animation" and load them up using the "library()" function before you use Josh's code. From there, open it up in R as a script and highlight the first two parts ("flightgrab()" and "plot.flight()"). Press "CTRL + R" for those first two functions. Then, you can use these as standard functions as you would anything else in R. With this code, you'll have to do this each time you open up R. Remember to use:

##load packages
library(XML)

library(animation)


Before you start using the functions and after you have installed them from the CRAN repository.

(I have also broken this full script down into smaller files so that you can use the "source()" function in R on the individual portions of the code. This keeps from having to highlight the code every time you open up R. I'll go over how to use "source()" in a later post, and I'll provide these as well. If you're dying to have this, shoot me an email.)

For the "flightgrab()" function, you'll need the Brooks Baseball URL for the game and pitcher you want. It should be the page with the table format of the data. You can find these by using the drop down menus at the site. Here is an example of a the page you need to use.

That's it. Just type this within the parentheses (and be sure to put the full URL in quotes) and R will grab the data directly from the website, transform it, and turn it into a data frame for plotting the pitch flights. Here is some example code below using Mariano Rivera's pitch data on May 25, 2011 against Toronto (remember to create the function in your R workspace first, so that R will know what "flightgrab()" is):

##grab data

mariano <- flightgrab("http://www.brooksbaseball.net/pfxVB/tabdel_expanded.php?pitchSel=121250&game=gid_2011_05_25_tormlb_nyamlb_1/&s_type=&h_size=700&v_size=500")


To check to see if the data was downloaded and transformed correctly by the function, you can just type "mariano" or whatever name you gave it to look at what is under the hood. From here, the "plot.flight()" function uses this data frame format to plot the flight of the ball. We can do this in two ways. If you take a look at the data set, it includes two pitch types: Four-Seamers and Cutters. The data grabbing function automatically puts the information into a flight sequence with 18 data points. So, both pitches have their own flight track. When we use "plot.flight()" to plot these in a color--say, Dark Red here--we don't know which is which. Below, I have a still version of the plot using a single color:

##make pitches different colors

plot.flight(mariano, color="darkred", strikezone=T)


However, we can condition color on the pitch type variable in the data set. We've gone through this in previous tutorials, and it can be done with the following simple code:

##make pitches different colors

plot.flight(mariano, color=mariano$type, strikezone=T)


And you can follow with the standard keys/legends for each pitch type. Remember that R colors them with the number of the color in alphabetical order. So, FC is the Cutter, and is represented by a "1" for colors, which is Black. That means FF is the red pitch flight shown. If you're only interested in plotting a single pitch type, simply use the code:


##plot only Mariano Rivera's Cutter

marianoFC <- subset(mariano, mariano$type=="FC")

plot.flight(marianoFC, color="darkred", strikezone=T)


And, you can leave out the strike zone box by not using the "strikezone=" option in the function (i.e. the default is no strike zone). But up to now, this isn't getting to the point. The point of all this is showing an animated version of the pitch flight. For this, Josh created a nice little "for loop". For loops are something I'll get to in some advanced plotting and simulation in the sab-R-metrics series, but essentially what it does is creates a plot for each of the 18 frames of the pitch. When we put these together in a GIF, it comes out as animation (just like a flip-book cartoon). In Josh's original script, this is the "savMovie()" function. For this, you'll need to download a program called Image Magick. You can download it at the link. This allows us to write GIF files from R using this function. Go ahead and do that now.

Okay, so now we're ready to create a Mariano Rivera movie. For this, we'll use the code from the "saveMovie()" function with a for loop to indicate each frame for each time interval in our data frame. Here is an example following directly from the code above:

####now use the saveMovie stuff
# Create gif


saveMovie( {
for(i in unique(mariano$time)) {
plot.flight(mariano[mariano$time==i,], col=c(1,2), strikezone=T)
text(3.7, 8, "Cutter", col=1, cex=1.3)
text(3.7, 7.6, "Four Seam", col=2, cex=1.3)
}
},
movie.name='mariano.gif', interval=.5)


The GIF should open up in Image Magick automatically. Once there, you can right click on the GIF and save it to the place you want. (Note that you need to click on the GIF image above for it to be animated)

For fun, Josh also provided some code for A.J. Burnett. I have the code and the animation below comparing his knuckle-curve with his four-seam fastball (remember, these are Gameday pitch types).
Note that for some reason the ball size is reversed in the animations. I'm not sure why this happens, but I'm trying to work it out. I've been toying around with Josh's original scaling of the perception of the ball size, and I seem to have messed something up when it goes into the create GIF mode. Optimally, I think the ball should get larger as it nears the plate, rather than smaller.

Thanks to Josh for providing this and posting it up here. This is some great work and hopefully others out there can put this function to good use!

Monday, May 30, 2011

Mythical Legends of a College Athlete

Chubby Johnson is a star outfielder for the Northwest Michigan State Tech Rhinoceroses, a not-so-well known Division III baseball team that I made up. Johnson has a chance to get drafted in the upcoming MLB draft, but has decided baseball just isn't for him. So back in March, Chubby went through his career center to find a job at a large financial firm.

Chubby majored in psychology with a 3.3 GPA. He's a solid but not great student. Chubby has always wanted to make lots of money on Wall Street and drive a BMW, but the partying and women just didn't allow him to do math homework every night. That's why he majored in Psychology.

His career center encouraged him to use the alumni network to find NMST grads that may work in the field out on Wall Street. He found a previous graduate named John Scomb. Scomb, it turns out, is one of the biggest NMST baseball fans that ever lived. Despite his mediocre GPA and psychology major, Scomb invites Chubby out to New York to be wined, dined and interviewed alongside a number of Harvard Business School grads and MIT Finance majors. Chubby graciously accepts.

Chubby's plane ticket arrives in April and the interview comes in the midst of the conference tournament. Luckily, Chubby is able to work around this and flies out to New York after NMST wins the conference tourney and a birth to regionals. As would be expected, the flight is on the financial company's dollar.

When he gets there, Chubby is greeted by a driver that will take him to a nice hotel in Manhattan. Around noon, Scomb shows up with his boss--Chuck McCourt--and they take Chubby out to lunch at a swanky restaurant. Scomb pays for the meal. During the lunch, they mostly talk baseball and ask about NMST's chances of winning it all this year.

After lunch, Chubby is taken to a Yankees game--his favorite team--in the company luxury suite. He gets a back massage and has a couple beers, then heads back to the hotel to take a shower before going out to the bar that night.

The next day, Chubby is hungover. He can't think straight, but he has to make it to the interview on time. The driver is waiting for him and takes him to the 9 am interview. He walks in ready to be grilled by Scomb and his boss. But only Scomb is there. He asks Chubby why he wants to work in the financial industry after majoring in Psychology. Chubby tells him he's always loved the idea of 'being on Wall Street'. They talk a little more baseball, Chubby does some math problems, and Scomb tells him he's hired. He's always wanted a fellow NMST-er there with him at the company.

Chubby goes back to NMST to play out the season, and tells his coach all about how he won't be entering the draft, but was offered a job on Wall Street by Scombs. The coach knew Scombs, as he had tried out for the team but been cut his freshman year at NMST.

The Rhinoceroses win the NCAA Division III National Championship in Appleton, Wisconsin. Chubby graduates and moves to New York to begin his new life as a broker--much in thanks to being a standout baseball player at NMST. Ever since that year, NMST has proudly displayed the trophy and a picture of Chubby Johnson as the tournament MVP in their brand new gym--mostly paid for by the huge influx of applications after students heard about the great baseball atmosphere there.

Should the baseball coach at NMST resign?

After all, a job on Wall Street is worth a lot more than a free tattoo at some random tattoo parlor in Columbus, Ohio.

Wednesday, May 25, 2011

sab-R-metrics: Kernel Density Smoothing

Last time I left you, I had gone over some basics of doing loess regression in R. If you remember, loess is a sort of regression that allows wigglyness in your regression of some dependent variable Y on some independent variable X (I will generalize this to more than one dimension later on). However, sometimes we're not always interested in how X affects Y. Sometimes we may simply be interested in the distribution and frequencies of some value of X. An example that comes to mind is Pitch F/X data, where we want to see the frequency of pitch locations in the strike zone.

For showing pitch location, we're going to need two dimensions. Today, I'm going to begin with a single dimension of kernel density smoothing. It's important to understand what this method is doing before jumping into displaying it in these multiple dimensions like the Pitch F/X heat maps you see everywhere nowadays.

For this tutorial, I want you to first go ahead and grab the Jeremy Guthrie pitch data from last time. You can download it directly here. We'll look to smooth the starting velocity of pitches for Guthrie from the Pitch F/X data.

##set working directory and load data and take a look at it

setwd("c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics")


guth <- read.csv(file="guthrie.csv", h=T)


head(guth)


Okay, now that the data is loaded in, let's start with a histogram. Why? Well, density smoothing and histograms are very much related. Kernel density smoothing is really just a smooth, pretty version of a histogram. We'll start simple with looking at the distribution of speeds for all pitches, then get a little more advanced and overlay the velocity of each pitch type on top of one another. For this exercise, we'll have to look back to a previous post and use the "hist()" function in R. This will produce a nice histogram with the defaults. Go ahead and try it yourself. Practice using the RGB scheme and making transparent colors if you want. The histogram function has a bunch of options, but I'll keep things basic for now. Don't forget to give your plot a title and label the axes so we know what we're plotting. Finally, be sure to tell R that you want to plot the density with the command "freq=FALSE" within the histogram function. We'll see why a little later. If you get stuck, you can reference the code below:

##create histogram of Guthrie pitch speeds

hist(guth$start_speed, xlab="Speed Out of Hand (MPH)", main="Jeremy Guthrie Pitch Speed Distribution", freq=FALSE, col="#99550070")


Okay, this looks pretty much like what we had in the previous tutorial. The range of pitch speeds included is strange on the left of the plot (we'll fix this later, it's likely just PFX data issues on 1 or 2 data points). But what does this have to do with kernel density smoothing? Well, density smoothing is another way of representing this distribution. If we don't really like putting things in "bins" in a histogram, we can use a smoother to look at the distribution. There are some real advantages to this, as we will see in a few paragraphs. I'll provide some very brief background on what the kernel smoother actually does.

In a histogram, we use bins with a given bandwidth to group together observations and get a rough estimate at the probability density function (PDF...not the Adobe kind) of our data. We can affect the shape by changing the bandwidth and number of bins if we'd like. But the discrete display isn't always optimal. The kernel smoother allows us to take a weighted average at each observation. The points that are closer to each observation will be weighted more heavily in this average, while those far away are weighted less. This allows us to get a smooth representation of the density (much like the loess does when we have a dependent variable of interest--though loess is a variable bandwidth smoother). There are a number of ways to weight your density, but there are some standard "kernels" that are used in practice. The most popular are the Gaussian and the Epanechnikov kernels. Others include the Triweight, Biweight, Uniform, and so on. Really you can weight it however you want, but the Gaussian and Epanechnikov should do the trick for purposes here.

So, for the Gaussian kernel, picture the normal distribution. In kernel smoothing, we use the height of the standard normal to weight all the observations within the vicinity of each observation in our data set. So, for a 90 mph pitch, those that are 91 mph and 89 mph are likely weighted heavily in deciding the height of our smoothed representation. As you get further from the middle of the normal, you notice that the tails get shorter and shorter. Since the height of the curve is the weight, those observations further and further from our point of interest will be weighted less and less based on these tails. Moving this along each value in our distribution of pitch speeds (i.e. this is done multiple times for each value in the range of the data), we get a weighted, smoothed height of our density curve.

The default in R is the Gaussian kernel, but you can specify what you want by using the "kernel=" option and just typing the name of your desired kernel (i.e. "gaussian" or "epanechnikov"). Let's apply this using the "density()" function in R and just using the defaults for the kernel.

##estimate pitch speed density and plot (be sure to tell R to ignore missing values!)

pdens <- density(guth$start_speed, na.rm=T)
pdens1 <- density(guth$start_speed, bw=.1, na.rm=T)
pdens5 <- density(guth$start_speed, bw=.5, na.rm=T)
pdens40 <- density(guth$start_speed, bw=4, na.rm=T)


par(mfrow=c(2,1))

plot(pdens, col="black", lwd=3, xlab="Speed Out of Hand (MPH)", main="Default KDE")
plot(pdens1, col="red", lwd=2, xlab="Speed Out of Hand (MPH)", main="Kernel Density Estimation")

lines(pdens5, col="green", lwd=2)
lines(pdens40, col="gold", lwd=2)
lines(pdens, col="black", lwd=2)

text(50, 0.15, "bw = 0.1", col="red", cex=1.3)
text(50, 0.14, "bw = 0.5", col="green", cex=1.3)
text(50, 0.13, "bw = 4", col="gold", cex=1.3)
text(50, 0.12, "bw = nrd0", col="black", cex=1.3)


In first part of the code above, I simply let R calculate the bandwidth using a rule of thumb (called "nrd0" in R in the first plot...I won't get too much into optimizing bandwidth, but use the same logic that we did with loess smoothing). Like with the loess regression, you may want to play with different bandwidths and see how this affects the smoothing and look of the distribution. You can adjust this using the option "bw=" within the "density()" function. I've plotted multiple lines with different bandwidths to illustrate above in the bottom plot.

You can see that the optimized bandwidth (the default) seems to smooth nicely. On the other hand, the red line (a small bandwidth) is much too wiggly, while the yellow line (very large bandwidth) tells us that Guthrie gets the ball up over 100 mph fairly often. Also, we see there seem to be some outliers (likely Pitch F/X errors). We can fix this up a bit by using some additional options in the "density()" function in R. By telling R to smooth "from=" and "to=", we ensure that we're only smoothing over the real range of the data and can ignore those weird few points on the left. In addition to this, we can ensure that there are fewer issues at the edges of the data (i.e. smoothing up past 100 mph--we had similar issues with loess smoothing if you remember the widening of the confidence intervals at the edges of the data range) using the option "cut=", but I won't cover this option today. I encourage you to fiddle with it on your own. Let's try limiting the range of the smoothing for now:

###limit smoothing range to get rid of crappy FX velocity data

pdensB <- density(guth$start_speed, na.rm=T, from=65, to=100)


plot(pdensB, col="black", lwd=3, xlab="Speed Out of Hand (MPH)", main="Kernel Density Estimation")



Now this is a little bit easier to look at. Now, to see how this relates to our histogram, let's overlay the data on the histogram. Call the histogram, and then we'll add the above density curve using "lines()". Note that we'll have to set the y-axis limits to ensure that the entire smooth shows up on our plot (the range of density is larger for the smooth than the default histogram) and reset the x-limits to keep the histogram from including all the useless left-tail pitch speeds.

###add density to original histogram

hist(guth$start_speed, xlab="Speed Out of Hand (MPH)", main="Jeremy Guthrie Pitch Speed Distribution", freq=FALSE, col="#99550070", ylim=c(0,0.13), xlim=c(65, 100))


lines(pdensB, col="black", lwd=3)


Notice how the smoother allowed us to see the tri-modal distribution of pitch speeds, while the histogram may not have provided us with enough bins. This seems to allude to the idea that Guthrie is throwing 3 (or 4) different pitch types that have different speeds. Let's see what these are by estimating the density by pitch type. I'll group fastball types together, and curveballs, change-ups, and sliders each separately. We can plot them on the same plot and see how they compare.

###now separate by pitch type
pdens.F <- density(guth$start_speed[guth$pitch_type=="FA" | guth$pitch_type=="FF" | guth$pitch_type=="FC" |
guth$pitch_type=="FT" | guth$pitch_type=="SI"], from=65, to=100)

pdens.CU <- density(guth$start_speed[guth$pitch_type=="CU"], from=65, to=100)

pdens.SL <- density(guth$start_speed[guth$pitch_type=="SL"], from=65, to=100)


pdens.CH <- density(guth$start_speed[guth$pitch_type=="CH"], from=65, to=100)


plot(pdens.F, col="black", lwd=2, xlab="Speed Out of Hand (MPH)", main="Pitch Speed Distribution by Pitch Type", ylim=c(0,.25))


lines(pdens.CU, col="darkred", lwd=2)

lines(pdens.SL, col="darkgreen", lwd=2)

lines(pdens.CH, col="darkblue", lwd=2)


text(67, 0.25, "Fastballs", col="black")

text(67, 0.24, "Curveballs", col="darkred")

text(67, 0.23, "Sliders", col="darkgreen")

text(67, 0.22, "Change-Ups", col="darkblue")



So, given what we see in the plot above, it looks like Gameday may have mis-classified some Sliders as Curveballs. Guthrie's Change and Slider seem to be near the same velocity, while--of course--the fastest pitches are the fastball variants classified here. It also looks like there may be one or two Curves and Sliders given the tiny bump on the right for these pitches. Of course, this isn't cluster analysis (though, this type of comparison is sort of what clustering does, and I'll go over formal cluster analysis later in the series...promise!). However, we may be able to discern some things by running this same sort of comparison for movement variables included in the Pitch F/X data. I'll leave that up to the reader to fiddle around with as a learning exercise.

Now that we've covered kernel density estimation in a single dimension, we can move on to covering this in two dimensions. Estimating it is quite easy in R, but displaying the data is the difficult part. We'll need a third dimension to display data. This can be done using color or using a 3-dimensional looking plot. I prefer color, as it creates the really neat heat maps that we've seen around the net.


And...Pretty R Code this time!

#############################
################Kernel Density Estimation and Plotting (single dimension)
#############################

##set working directory and load data
setwd("c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics")

guth <- read.csv(file="guthrie.csv", h=T)
head(guth)

###make histogram of Guthrie pitch speed
png(file="GuthHist1.png", height=500, width=500)
hist(guth$start_speed, xlab="Speed Out of Hand (MPH)", main="Jeremy Guthrie Pitch Speed Distribution",
freq=FALSE, col="#99550070")
dev.off()

##estimate pitch speed density (be sure to tell R to ignore missing values!)
pdens <- density(guth$start_speed, na.rm=T)
pdens1 <- density(guth$start_speed, bw=.1, na.rm=T)
pdens5 <- density(guth$start_speed, bw=.5, na.rm=T)
pdens40 <- density(guth$start_speed, bw=4, na.rm=T)

png(file="GuthDens1.png", height=1000, width=650)
par(mfrow=c(2,1))
plot(pdens, col="black", lwd=3, xlab="Speed Out of Hand (MPH)", main="Default KDE")
plot(pdens1, col="red", lwd=2, xlab="Speed Out of Hand (MPH)", main="Kernel Density Estimation")
lines(pdens5, col="green", lwd=2)
lines(pdens40, col="gold", lwd=2)
lines(pdens, col="black", lwd=2)
text(50, 0.15, "bw = 0.1", col="red", cex=1.3)
text(50, 0.14, "bw = 0.5", col="green", cex=1.3)
text(50, 0.13, "bw = 4", col="gold", cex=1.3)
text(50, 0.12, "bw = nrd0", col="black", cex=1.3)
dev.off()

###limit range of smoothing
pdensB <- density(guth$start_speed, na.rm=T, from=65, to=100)
png(file="GuthDens2.png", height=500, width=600)
plot(pdensB, col="black", lwd=3, xlab="Speed Out of Hand (MPH)", main="Kernel Density Estimation")
dev.off()

###add density to original histogram
png(file="GuthDensHist.png", height=500, width=600)
hist(guth$start_speed, xlab="Speed Out of Hand (MPH)", main="Jeremy Guthrie Pitch Speed Distribution",
freq=FALSE, col="#99550070", ylim=c(0,0.13), xlim=c(65, 100))
lines(pdensB, col="black", lwd=3)
dev.off()


###now separate by pitch type
pdens.F <- density(guth$start_speed[guth$pitch_type=="FA" | guth$pitch_type=="FF" | guth$pitch_type=="FC" |
guth$pitch_type=="FT" | guth$pitch_type=="SI"], from=65, to=100)
pdens.CU <- density(guth$start_speed[guth$pitch_type=="CU"], from=65, to=100)
pdens.SL <- density(guth$start_speed[guth$pitch_type=="SL"], from=65, to=100)
pdens.CH <- density(guth$start_speed[guth$pitch_type=="CH"], from=65, to=100)

png(file="GuthDensType.png", height=500, width=600)
plot(pdens.F, col="black", lwd=2, xlab="Speed Out of Hand (MPH)",
main="Pitch Speed Distribution by Pitch Type", ylim=c(0,.25))
lines(pdens.CU, col="darkred", lwd=2)
lines(pdens.SL, col="darkgreen", lwd=2)
lines(pdens.CH, col="darkblue", lwd=2)
text(67, 0.25, "Fastballs", col="black")
text(67, 0.24, "Curveballs", col="darkred")
text(67, 0.23, "Sliders", col="darkgreen")
text(67, 0.22, "Change-Ups", col="darkblue")

dev.off()

Wednesday, May 11, 2011

sab-R-metrics: Basics of LOESS Regression

Last week, I left you off at logistic regression. This week, I'll be pushing the limits of regression analysis a bit more with a smoothing technique called LOESS regression. There are a number of smoothing methods that can be used, such as Smoothing Splines or simple Local Linear Regression; however, I'm going to cover LOESS (loess) here because it is very flexible and easy to implement in R. Remember that here, I'm not going to cover too much of the quantitative portion of the methods. That means that if you plan on using loess in your own work, you should probably read up on what it is actually doing. I'll begin with a brief, non-mathematical description.

When we ran regressions using OLS procedures, there is an assumption that the relationship between the X and Y variable is monotonic and constant across the domain and range of each variable (i.e. that as X increases, Y also increases--at the same rate for all X and Y). Of course in the real world, this is not always the case. You can use polynomials in linear regression to address the issue, but sometimes other methods may be necessary. This is where smoothing comes in.

Using smoothers, there is no restriction on the functional form between X and Y with respect to intensity of the relationship, or direction (positive or negative). Of course, this means our fits are a bit more computationally intensive. And if not careful, it is very easy to overfit the data by trying to include every wiggle we see. But if done properly, one may be able to glean some extra information from the data by using a smoother instead of a restrictive linear model.

So what is loess? Well, as I said there are a number of smoothers out there. The advantage of loess (with its predecessor 'LOWESS') is that it allows a bit more flexibility than some other smoothers. The name 'loess' stands for Locally Weighted Least Squares Regression. So, it uses more local data to estimate our Y variable. But it is also known as a variable bandwidth smoother, in that it uses a 'nearest neighbors' method to smooth. If you are interested in the guts of LOESS, a Google search should do you just fine.

As usual, there is a nice easy function for loess in R. The first thing you'll need to do is download a new data set from my site called "guthrie.csv". This is a Pitch F/X data set from Joe Lefkowitz's site including all pitches by Jeremy Guthrie from 2008 through 2011 (as of May 11, 2011). If you are a Baseball Prospectus reader and ran across Mike Fast's most recent article, you'll understand why I think this is a nice data set for implementing loess...or at least you will by the end of this tutorial.

Once you have the data, go ahead and set your working directory and load it in. I'm naming my initial version of the data "guth":

####set working directory and load data

setwd("c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics")

guth <- read.csv(file="guthrie.csv", h=T) head(guth)

Now because I plan on working with pitch velocity data today, I want to make sure we're including pitches of a certain type. For this reason, I want to go ahead and subset the data into only fastball variants thrown by Guthrie over this time period. That way, the data aren't contaminated with change-ups and curveballs of lower velocity. We want to look specifically at arm strength. The following code should subset the data correctly. Remember that the "|" means "OR" in R.

##subset to just fastballs and fastball variants

guthfast <- subset(guth, guth$pitch_type=="FA" | guth$pitch_type=="FF" | guth$pitch_type=="FC" | guth$pitch_type=="FT" | guth$pitch_type=="SI")

Now, because we want this to be an ordered series of pitches across time, we'll have to create a new variable to represent this sequence. For this, we'll make use of a new function that comes very much in handy when working with and visualizing smoothing analyses. It is called "seq()", and it creates a sequence of numbers. The first command below "from=" indicates the starting point of your sequence, and "to=" represents the endpoint. Finally, "by=" tells R the space between your points. You can put any number into these that you want. The smaller the "by=", the more points you will have. I'm going to keep it simple and use "by=1", so that we'll have a count of the pitches.

##create a time sequence for the pitches

guthfast$pitch_num <- seq(from=1, to=length(guthfast[,1]), by=1)

head(data)

You'll notice in the "seq()" function above, I tell R to count up to the length of the dataset. By typing "length(guthfast[,1])" I am indicating that I want the number of rows (i.e. the 'length' of the first column in our data set). This way, if we count by 1, we know that every pitch will have a sequential integer-valued number in the Pitch Number variable we appended onto the data set.

Now that we have this set up, let's take a quick look at Guthrie's pitch velocity over time. The code below should plot each pitch's speed as a function of the pitch number:

##plot all fastball pitch velocity by the pitch number variable

plot(guthfast$start_speed ~ guthfast$pitch_num, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Fastball Count (2008-2011)", main="Pitch Speed by Sequential Fastball")



Looking at the plot above we can see some semblance of a pattern, but the data seem to be too noisy to really see what is going on. Perhaps if we use the average for each game we'll be able to see something more useful. For this, we'll make use of the "tapply()" function again. If you don't remember this function, head back to the earlier sab-R-metrics posts and check it out. This function allows us to quickly take the average fastball velocity by game using the following code:

##get mean fastball velocity by game

aggdat <- tapply(guthfast$start_speed, guthfast$gid, mean)

aggdat <- as.data.frame(aggdat) head(aggdat)

This function spits out a vector with the game id's as the row names. However, we convert it to a data frame using the function "as.data.frame()" so that we can use our standard object calls and variable names. Unfortunately, you'll see that we don't have the right variable name. It's just called "aggdat" for the average velocity for each game. We can use an easy function in R to fix this up. But first, let's append a count of the game numbers just like we did with the pitch numbers so that we have them sequentially over time:

##create game numbers

aggdat$game_num <- seq(from=1, to=length(aggdat[,1]), 1)


##change column names

colnames(aggdat) <- c("start_speed", "game_num") head(aggdat)

As you can see, our column names are what they should be now. Using this new data set, let's again plot the fastball velocity over time. Note that we reduced the data to only 101 data points (games):

###plot average velocity by game across time

plot(aggdat$start_speed ~ aggdat$game_num, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Game Number (2008-2011)", main="Pitch Speed by Sequential Game")


Here, we see that there are many fewer data points than before. But it's still a bit tough to understand any pattern going on. We could start with an OLS model to fit the data linearly. Using the code below, you should get output that tells you Guthrie's fastball velocity is decreasing over time and that this is significant at the 1% level. But is that really the case? Once the model is fitted, go ahead and plot the regression line using the second bit of code. We'll save the standard errors from the model for later on.


##fit a linear model to the data

fit.ols <- lm(aggdat$start_speed ~ aggdat$game_num) summary(fit.ols)

pred.ols <- predict(fit.ols, aggdat, se=TRUE)


##plot the regression line on the data

plot(aggdat$start_speed ~ aggdat$game_num, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Game Number (2008-2011)", main="Pitch Speed by Sequential Game")

lines(pred.ols$fit, lty="solid", col="darkgreen", lwd=3)


We see the negative slope in the plot above, but do you think this is the best representation of the data? Using a loess regression, we may be able to improve on this. Unfortunately, the drawback from the loess is that there isn't really a clean functional form like we get from OLS. That means no real 'coefficients' in a nice Y = mX + b form that we learned in algebra class. For the most part, the best way to use loess is to look at it.

So, to fit a loess regression, we'll go ahead and stick with the game average data for now. Using the "loess()" function (doesn't get much easier than that!), we can apply the new analytical tool to our data. Let's try some basic code first. Below, I estimate the loess using a default smoothing parameter and then predict values and plot it just like with the OLS model. Only this time, it looks a bit different.

##loess default estimation

fitd <- loess(aggdat$start_speed ~ aggdat$game_num)

my.count <- seq(from=1, to=101, by=1)

predd <- predict(fitd, my.count, se=TRUE) plot(aggdat$start_speed ~ aggdat$game_num, pch=16, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Pitch Count", main="Pitch Speed by Pitch Count")

lines(predd$fit, lty="solid", col="darkred", lwd=3)


In the code above, I kept things pretty basic. Usually, we want to use the argument "span=" in order to tell R how much smoothing we want. The larger the span, the more points that are included in the weighted estimation, and the smoother the plot will look. Something to be careful with here, though, is making the span too small. We don't want to over fit the data, but want it to give us some idea of the pattern that we see. If we fit every point on its own, we may as well just look at a scatterplot. On the other hand, if we smooth too much, we may as well just estimate an OLS regression. The idea is to find a balance between the two using the smoothing parameter.

You can also identify a polynomial, which allows for more 'wigglyness' in your loess. For our purposes, I'm not going to bother with this. However, if you are fitting something that you believe needs some serious wiggles, go ahead and fiddle around with the "degree=" argument in the "loess()" function. For the most part, I would not recommend going over 3 for the polynomial as you'll likely be bordering on over-fitting--but some data might well need further polynomials. The default in R for this function is "degree=2", and you can change it to "degree=1" if you like and you'll see your wigglyness--for the same span--decrease a bit. It all depends on your data. For the purposes of this post, we'll just 'eyeball it'. However, there are other ways to optimize smoothing parameters in loess and other smoothing (and density estimation--see next week) methods. These include "rules of thumb", cross-validation methods, and so on. The default span in this function in R is 0.75, but it should really depend on your data.

Now go ahead and add a parameter in your loess code. For this, just include the additional argument "span=" within the "loess()" function. Play around with it and see what happens. I'd say just work with values between 0.1 and 1.0. The code might look something like this:

##fiddling with the span

fit3 <- loess(aggdat$start_speed ~ aggdat$game_num, span=0.3)

pred3 <- predict(fit3, my.count, se=TRUE) plot(aggdat$start_speed ~ aggdat$game_num, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Sequential Game (2008-2011)", main="Pitch Speed by Pitch Count (Span=0.3)")

lines(pred3$fit, lty="solid", col="darkred", lwd=3)


Below I've embedded a quick video that shows how the loess line changes when we increase the span by 0.1 each time. I won't provide the code for the movie (it's just a repeat of the same code over and over, and I made the movie in Windows Movie Maker with still images). In the future, I'll be sure to get into 'for loops' to generate multiple versions of the same plot while incrementally changing a given parameter, but that's advanced for this post. Watch through the video and try to pick out what you think is the best representation of the data that does not over fit (over-wiggle).



You might be saying to yourself, "Well, it looks like the span of 0.6 smooths the best." If you said that, then I'd agree. Of course, if you disagree, that doesn't make you wrong. The default looks pretty good as well (with a span of 0.75). Remember we want to get to a balance of fit and smooth and we're just eyeballing it. While it's somewhat subjective in this case, I imagine that we would all come somewhere near a consensus on the range of acceptable smoothing.

You may also be saying to yourself, "How do I get those cool intervals?" If that's the case, then you're in luck. When we used the 'predict()' function earlier, I made sure to tell R to keep the standard errors from the loess. This allows us to plot a standard 95% Interval on our plot. We'll have to make use of some new functions here. First, let's create two sequenced data sets for our predictions and interval construction:

##interval construction stuff

my.count <- seq(from=1, to=101, by=1)
my.count.rev <- order(my.count, decreasing=TRUE)

The second line simply reverses the order of the first. So each of these are vectors of the x-variable (game number) in increasing and decreasing order, respectively. From here, we can go ahead and re-plot our span=0.6 version of the loess and add dashed lines for the confidence intervals using some basic math. After this, we'll add some fill color and apply what we learned about the RGB color scheme in the Bubble Plots tutorial.

##fit the span=0.6 model

fit6 <- loess(aggdat$start_speed ~ aggdat$game_num, span=0.6)

pred6 <- predict(fit6, my.count, se=TRUE)


##now plot it


plot(aggdat$start_speed ~ aggdat$game_num, ylab="Speed out of Hand (Fastballs, MPH)", xlab="Sequential Game (2008-2011)", main="Pitch Speed by Pitch Count (Span=0.6)")

lines(pred6$fit, lty="solid", col="darkred", lwd=3)



##now add the confidence interval lines


lines(pred6$fit-1.96*pred6$se.fit, lty="dashed", col="blue", lwd=1)
lines(pred6$fit+1.96*pred6$se.fit, lty="dashed", col="blue", lwd=1)



You can see here that we use the 1.96 as an approximation of the 95% interval. In the plot above, we see the interval represented by the blue dashed lines. However, I really like the filled interval look. For this, we'll need to use the "polygon()" function and the code below.

The first portion of the code below tells R that we want to create an outline of a polygon on the y-axis with the confidence bounds at each point along the two vectors we created above. The second line just recreates the x-axis, but in increasing then decreasing form to get full coverage of the shape. Finally, we create a shape using the confidence bounds and fill it with a transparent color (otherwise it will cover up everything). The #00009933 indicates that we want it completely blue (99 in digits 5 and 6), with some transparency (33 in digits 7 and 8). As long as your plot is still up, this code will simply add

###create polygon bounds

y.polygon.6 <- c((pred6$fit+1.96*pred6$se.fit)[my.count], (pred6$fit-1.96*pred6$se.fit)[my.count.rev])

x.polygon <- c(my.count, my.count.rev)


##add this to the plot

polygon(x.polygon, y.polygon.6, col="#00009933", border=NA)

Notice how the interval bands flare out at the end. This is because there is less data at the endpoints (i.e. beyond the enpoints) of the data. Therefore, there is less certainty about the prediction here. This is a common problem in any regression (including linear regression), but are exacerbated in smoothing because of the ability for a single point at the edge of the distribution having too much influence on the direction of the endpoints of the smoothed line. Just something to be aware of.

Lastly, if you're feeling lazy, you can always just use the "scatter.smooth()" function. This will automatically plot your loess, and takes the same arguments as "loess()". However, here you simultaneously provide the plotting parameters. See the code and plot below:

###show how scatter smooth just does the plot automatically

scatter.smooth(aggdat$start_speed ~ aggdat$game_num, degree=2, span=0.6, col="red", main="Scatter Smooth Version (Span=0.6)", xlab="Sequential Game Number", ylab="Starting Speed (Fastballs, MPH)")


The above plot isn't as pleasing to me as the ones I made manually. In general, by doing things manually you will have more control over the look of things, but if you're looking for something quick, "smoothscatter()" does just fine. One thing to remember, however, is that it has a different default for the polynomial than the "loess()" function, so if you want the same fit, you'll have to tell R that "degree=2".

There are other options in loess that I haven't covered today, including the ability to parametrically estimate some variables, while applying the loess function to others. This can come in handy if you think only some variables are non-linear, while others are linear in nature. If this interests you, definitely check into it. Other packages, like mgcv, allow for similar model types using slightly different smoothers and an extension of the smoothing function to Generalized Linear Models (binomial response, etc.). Hopefully this will give you a nice base to work with loess regression in your own work, but keep in mind that these tutorials are not a replacement for understanding the underlying mathematics that create the pretty pictures. Loess can be terribly misused in the wrong hands (especially with pitch-location smoothing), so it is important to understand WHY you are doing certain things, not just HOW to do it in R.

I don't currently have Pretty-R code up and running, as the Blogger HTML really effs with embedding it in here. All of the necessary code is included above (and remember if you want to save plots and pictures, use a graphics device like "png()").

Thursday, May 5, 2011

sab-R-metrics: Logistic Regression

It's been a while since my last sab-R-metrics post, and I have not gotten to the real fun stuff yet. I apologize for the long layoff, and it's likely that these will be sparse for the next couple weeks. I have had some consulting opportunities come up, I've got 6 (possibly 7) presentations or co-authored presentations coming up this summer, I had to finish up a dissertation proposal and final exam, and apparently I have to start thinking about getting on the job market. Yeesh!

Today, I'll continue with logistic regression, but first be sure to check out the last few posts on Simple OLS, Multiple Regression, and Scatterplot Matrices. We'll be using some of the lessons from these posts for today's tutorial.

Logistic regression is a generalization of linear regression that allows us to model probabilities of binomial events. Normally in OLS we want to have a dependent variable that is continuous and--optimally--normally distributed. Without getting too in-depth with continuous vs. discrete measures (technically anything we measure is discrete once we measure it and round it even to the umptillionth decimal), it's pretty easy to tell that a dependent variable with 0 representing a "No" and a 1 representing a "Yes" response is not continuous or normally distributed. In this sort of application, we are often interested in probability and how certain variables affect that probability of the Yes or No answer. Since probability must be between 0 and 1, we need to bound our data at these limits. Otherwise you'll get probability predictions above 1 and below 0.

For the most part, using OLS as a quick and dirty way to estimate a binomial variable is reasonable. And for observations that are not near 0 or 1, it can work just fine. But with the available computer power that comes free in R, there is little reason not to fit a Generalized Linear Model to the binomial response (unless of course you have a very large data set). So today, we'll again use the data "hallhitters2.csv" which you can download here. What is more fun than trying to predict Hall of Fame induction? We'll keep things simple today, but it is important to note that this isn't any sort of new idea. Cyril Morong, among others, have used logistic regression for Hall of Fame prediction in the past.

For this exercise, I am going to assume that BBWAA writers only pay attention to traditional statistics. That means we'll only include Home Runs, Runs, Runs Batted In, Hits, Batting Average and Stolen Bases in our regression. Obviously this will be limited, and doesn't account for fielding, 'integrity' or whatever else the BBWAA claims to take into account. But it should be plenty good enough to get the point across.

Let's jump right in. Go ahead and load up the data. I'm going to name mine 'hall':

##set working directory and load data

setwd("c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics")

hall <- read.csv(file="hallhitters2.csv", h=T)

head(hall)

Alright, let's first run a standard OLS model on the data, followed by the logistic regression to compare. I have included a variable called "BBWAA" in the data so that we can use this as our response variable. If a player is voted into the Hall by the BBWAA, they are given a '1', otherwise the variable will be '0'. For logistic regression, we'll use a new R function called "glm", which of course stands for Generalized Linear Models. This function can be used for a number of linear model generalizations and link functions (such as Poisson regression or using the Probit link function for binomial data). We'll stick with the binomial version and logistic regression here.

##start with an OLS version and predict hall of fame probability

fit.ols <- lm(BBWAA ~ H + HR + RBI + SB + R + BA, data=hall)


summary(fit.ols)


##fit logistic regression


fit.logit <- glm(BBWAA ~ H + HR + RBI + SB + R + BA, data=hall, family=binomial(link="logit"))


summary(fit.logit)


Notice the slight difference in coding, as we have to tell R that the GLM we want to use is for binomial data with a 'logit link function'. We'll revisit the OLS model later on, but for now let's focus on the logistic regression coefficients. These are a bit different from the coefficients of our linear models from before. Just reading the coefficients off the table as the increase in probability of induction won't work. The coefficients must be transformed. The logistic regression uses a transformation of the predictors, 'x', in order to model the probability on a 0 to 1 scale. The transformation used here is:

InverseLogit(x) = e^x/(1+e^x)

or

Logit(y) = BXi

Here, B are the coefficients, while Xi represents our predictor variables. So when we look at the coefficients, we have to do something with them in order to make sense out of what we're seeing in the regression output. To get the effect, we can use the following transformation, including the intercept from the regression. Let's check out the association of Home Runs and the probability of induction into the Hall of Fame in as simple a way as possible:

##exponentiate HR coefficient to get probability of induction increase per HR

exp(0.002153)


Which comes out to about 0.215% increase in induction odds per home run. So, for a coefficient like this, we pretty much can read it directly off the table for those players around the average home run total. However, it is important to understand this isn't a coefficient like in OLS. If we simply multiply this change by Hank Aaron's 755 Home Runs, we get a probability of induction for him at about 162%. Remember that in a model like this, the influence of the HR total decreases as the probability of the player reaching the Hall gets closer and closer to 1. Otherwise, we'd end up with probabilities above 1 and below 0. For a good primer on interpreting coefficients from a logistic regression model, check out this site through the UCLA stats department.

Now, it's interesting to look at coefficients, but because of the collinearity issues, they may not be all that useful for interpretation. Multicollinearity is the likely culprit for the coefficients on Hits being negative (and recall our scatter plot matrix). It would probably be optimal to create some sort of orthogonal set of predictors with Principal Components Analysis (PCA) or something of the sort (which I plan to get to in a later sab-R-metrics post). But for now, we'll just use the logistic regression for prediction.

However, another valuable part of logistic regression is its prediction of a probability of a 'success' (here, induction into the Hall of Fame by the BBWAA). Because the coefficients are in log-odds form, we need to use "response" when predicting to get actual probabilities (0 to 1). Go ahead and predict the probability of induction using both the OLS and the Logistic model and attach this to the end of your data set with the following code:


##predict values and append to data

hall$ols.pred <- predict(fit.ols, hall, type="response")



##predict the probability of induction using the logit model

hall$log.pred <- predict(fit.logit, hall, type="response")


head(hall)

From here, it can be easier to look at the output if we create a new table in Excel using this data. Of course, you could copy and paste into Excel if you want, but there's a nice easy option in R to create a table. Keeping with the CSV format I like to work with, we can create a new table in our working directory from the newly updated data set with Hall of Fame probabilities with the following code:


##create table for looking at in Excel

write.table(hall, file="hallhittersPRED.csv", row.names=F, sep=",")


This rather convenient function comes in handy for me very often. In the function above, you tell R that you want to write a new file using the object "hall". Then you name the file. Here, be sure to put the .csv extension on the end. You could also do .txt, and there are of course other options. I use the 'row.names=F' to tell R that I don't need the row names in the new file, but there are cases when this makes sense to allow (i.e. when you are writing a table from the output of the 'tapply()' function). Finally, we tell R what to use as a separator with "sep=". Be sure to put this in quotes. You can use anything you want, but the safest bet is to use commas (assuming your vectors don't have commas in them, but they should have double quotes protecting the fields anyway). I usually use write.table so I have full control over the file; however, you can also use the simple code below and get the same result:


###using write.csv now

write.csv(hall, file="hallhittersPRED.csv", row.names=F)


Go to your working directory and open up the CSV file in Excel. Now, inspect the predicted probabilities of the logit model first. They'll be nice and easy to sort in Excel.

The logit probabilities seem perfectly reasonable, with the likes of Rickey Henderson, Cal Ripken and Tony Gwynn about as near sure candidates as you can get. On the other hand, we see that Kirby Puckett doesn't get as high of a probability. We know why though: Puckett's career was cut short. With our knowledge of baseball at hand, these probabilities mostly seem to make sense.

However, looking at the OLS model, there are some strange things going on. The model doesn't particularly like Ripken or Gwynn. If you go ahead and sort your data by the predicted OLS probabilities, you'll see that there are a number of negative induction probabilities for guys like Greg Pryor and Paul Casanova. We can just round these to zero, but optimally the logit model gives us a better look at what is going on in the data.

As before, it's sometimes useful to look at a plot of our data to see how our predictions fit the true outcomes. However, since we will be comparing 0-1 binomial data to a continuous probability prediction, we'll have to employ some new things in R for this visualization. Luckily, I recently found a very handy function in the "popbio" library for this. The first thing you'll need to do is install this package (I went over how to do this before). Next, use the following code to create a plot with the "logi.hist.plot" function:


##create plot of logit predictions

logi.hist.plot(hall$log.pred, hall$BBWAA, boxp=F, type="hist", col="gray", rug=T, xlab="Hall of Fame Status", mainlabel="Plot of Logistic Regression for BBWAA Inductions")


##add the names of mis-classified HOFers


low.prob <- subset(hall, hall$log.pred < .20 & hall$BBWAA==1)

text(low.prob$BBWAA, low.prob$log.pred, low.prob$last, col="darkred", cex=.8)



As you can see, the second portion of the code adds the names of the Hall of Famers who were predicted to have a very low probability of induction. You can see that the probability of making the hall of fame increases as we get closer to the "1" classification. In the middle, the probability of induction changes at the highest rate, while it tails off at each end in order to bound it between 0 and 1.


We can inspect these and for the most part understand why our model misses these guys. Jackie Robinson and Roy Campanella had shorter careers due to segregation and catastrophic injury, respectively. We didn't include fielding data in our model, so it's no surprise that Ozzie Smith was predicted so low. Guys like Kiner, Bordreau, Carter and Aparicio are less clear. But perhaps these were just inconsistencies in the voting by BBWAA members. Many would argue that these players don't really belong in the HOF anyway.

Now that we have this nice looking picture, how do we really evaluate the usefulness of our model? The most common way to evaluate something predicting binomial data is an ROC curve (and it's AUC-"Area Under Curve"). This benchmarks the performance of your model in predicting the correct class (i.e. 0 or 1). If you perfectly predict, the area under the curve will be 1. If you randomly chose 0 or 1, then you're looking at 0.5 (i.e. a 45 degree line in the plot seen below). Usually, we'll see some sort of curve between 0.5 and 1, but the goal is to get as close to 1 as possible (without over-fitting, of course).

For this, go ahead and install the package 'ROCR' onto your computer. We'll make use of this. Using the code below, you can create the ROC curve and also the AUC

##############ROC Curve

library(ROCR)

preds <- prediction(as.numeric(hall$log.pred), as.numeric(hall$BBWAA))


perf <- performance(preds, "tpr", "fpr")


perf2 <- performance(preds, "auc")

perf2



plot(perf, main="ROC Curve for Hall of Fame Predictions", lwd=2, col="darkgreen")

text(0.5, 0.5, "AUC = 0.9855472", cex=2.2, col="darkred")

The first portion of this code sets up the predictions and the true classes. The second line creates performance vectors using the "tpr" and "fpr" commands, which refer to "True Positive Rate" and "False Positive Rate", respectively. Then we plot them against one another and get the following:


Finally, the "perf2" object calculates the AUC for our model, which we find is a solid 0.986 or so. Of course, this is enhanced by the fact that there are so many easy decisions on non-Hall of Famers. If we included only "borderline candidates" for our AUC calculation, we would get a much lower number. Whether or not you believe it is appropriate to report such a high AUC produced by this fact is another question, but keep in mind that the 'accuracy' may be a bit misleading in this case. We don't need a predictive model to tell us that Joe Schmoe isn't going to be inducted into the Hall of Fame.

Finally, if we've decided that this model is accurate enough, we can apply this model to current players. For this, we'll use the file "currenthitters.csv" from my website here. For this, we want to be sure we have the same variable names for the same variables so that the new code knows what to call from the original logistic regression model. Here's the code below to load in the data, create the predictions, attach them to the data, and write a new table with the added predicted induction probabilities.

##get current hitter data and check variable names

current <- read.csv(file="currenthitters.csv", h=T)


head(current)


##predict probabilities


current$hall_prob <- predict(fit.logit, current, type="response")


head(current)


##write a new file with the induction probabilities


write.table(current, file="currentPRED.csv", row.names=F, sep=",")



Then go ahead and open up the file for easier inspection (or just snoop around in R if you want).
If you want to order them by induction probability in R directly, then use the following code:


###order by most likely to get into hall of fame

current <- current[order(-current$hall_prob),]
current[1:35,]


You can use the ordering code for any variable you want, and I use the "-" sign within the parentheses in order to tell R that I want things in descending order. The default is ascending order. The second line of code just calls the first 35 rows of the data.

You can see that Alex Rodriguez is the most likely in the sample of active players, with Ken Griffey, Jr. just behind him. Most of these make sense. Some are likely not right, like Luis Gonzalez and Bobby Abreu. Others, we know will have problems getting into the Hall based on steroid accusations, which we did not include in our model.

Some players are surprises here, but this is likely because they just don't have enough at bats yet to hit the milestones for the Hall of Fame. Keep in mind that the data are only through 2009 that I've provided, so Ichiro has less than 2,000 hits. Perhaps if we modeled each HOF players' career at each given point in time, we could use this to predict guys that haven't been around very long. But that's quite a long and complicated process, and the current model is as far as I'll take today's lesson. Finally, keep in mind that we are predicting induction probabilities based on past BBWAA inductions, NOT whether or not the player deserves to be in the Hall of Fame. This is a discussion left for another day.

So there you have it: predicting Hall of Fame induction and logistic regression in R. Remember that the 'glm()' function can be extended to other models like Probit and Poisson. Below is the code from Pretty R as usual.

CODE: (HTML IS BEING FUNKY, I WILL TRY TO FIX THIS ISSUE SOON!)