Wednesday, June 22, 2011

sab-R-metrics: Merging Data Sets

I am finally back from Greece and recovered from jet lag. Fortunately, I did not get tear gassed while in Athens, though there were riot police everywhere the whole time we visited. Today, I'm going to start getting my feet wet again with a shorter sab-R-metrics post to assure everyone I'm not too MIA.

Often times we have lots of data in different files that we want to link together. If you have the information in an SQL database, there are ways to match things up using R. However, I am no database management wizard and prefer to be able to look at my data in a full table format. Unfortunately, this causes problems when I want to make sure to have player names linked to the player ids in my Pitch F/X data. The issue is that the F/X data may have multiple instances or rows with the same player, while the player information file only has player ids and player names once (one per row). Doing this manually can take forever (sometimes almost literally), and we need a quick way to import player names to the correct rows. Pitch F/X tools like Joe Lefkowitz's already do this for you; however, if you have your own F/X database--or any other data with player ids that you would like to merge some data into--this tutorial should come in handy.

Luckily, R has a nice function, 'merge()', which allows for easy merging of files. While I used to use SPSS to do this, once I found the R version I'll never go back. The SPSS version is pretty handy, but extremely slow for large files and the software is outrageously expensive.

First, I want you to download a file of 5,000 pitches here. Once you have it in the correct place, load it into R and take a look at it.

#set working directory
setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")

##load pitch file
pitches <- read.csv(file="PitchesMerging.csv", h=T)

As you can see, there are no player names in this file. While you could go through and add them in manually--say in Excel or something like that--this would take way too long. To get an idea of the number of names to be imported just for this small pitch file, use the following code:

##give an idea of hte amount of work that manually merging would take

The first line of code above tells us the number of rows in the data set--or the length of the first column in the data. This comes in handy to make sure R loaded the number of rows you expected to see. The second line of code again uses the 'length()' function, but adds a new function we have not seen yet: 'unique()'. What this does is tells us how many different/unique batter ids there are in the data set. The third line of code does the same for pitcher ids. You can also use the 'unique()' function on its own, and R will print each of the player ids within the data file (you could also assign this list or vector as an object using the assignment operator '<-'). Unique will come in handy when we get into more advanced "for loops" later on.

As you can see, there are 286 unique batter ids and 113 unique pitcher ids. In addition, there are many repeats, as there are 5,000 observations in the data file. Doing this manually would take forever. Luckily, I have a file with the player ids, the player names, player height and weight, player birth dates, and the first year played in pro ball, MLB, and the last year played in MLB. We'll use R to easily merge this into our pitch file so that we can have player names and account for height and age of the player in our analyses using the pitch data.

First, go ahead and download the file with player names and some other information here. Stick that into the same directory as the previous file and load it into R. As always, take a look at the file to make sure it loaded correctly:

##load player information file
players <- read.csv(file="detailedplayers.csv", h=T)

Before doing any merging, we'll have to adjust some things with this file. For the 'merge()' function to work, you have to choose a variable that is contained in BOTH data sets to merge on. For our purposes, we'll use the id of the player. Unfortunately, the name of the variable is different in each file. This is an easy fix. While we're at it, it is probably a good idea to discriminate between the batter and pitcher names and information in the file, since both will be displayed in each row. So first thing is first...let's rename the variables. For this, we'll use another new function, 'colnames()'. The following code should rename everything the way we want, and we'll start by merging the new data for batters. Be sure not to omit the names of any columns or you will get an error:

##rename columns for batters
colnames(players) <- c("batter_id", "b_first", "b_last", "b_height", "b_weight", "b_birth_year", "b_pro_played_first", "b_mlb_played_first", "b_mlb_played_last")

Always check to be sure things went correctly. There is actually an option to do this automatically in the 'merge()' function as the command "suffix=". On data sets with a large number of columns, this can save you time. But I found this to be a good time to introduce the "colnames()" function.

Now we have two files with a similar variable to match on. It's time to use the 'merge()' function. The merge function asks first for a 'x' data set (the first one), and then a 'y' data set (the second one). It is important to remember what order you place them in the function, as you will also need to tell R that you want to keep all of the original pitches in this new merged data. To save space in R--once I know things are working right--I simply reassign the merged data set as the original name 'pitches'.

To ensure that R makes a data set using all the pitches in the file, we want to use the option "all.x=T" or "all.y=T". This will tell R that the players data are just a table being used for the pitch data, while we keep all the pitch data in tact in the new merged table. Finally, we need to tell R which variable to match on using by="batter_id". Be sure to put the variable name in quotes. The following code should do this for us:

##do merge for batters
pitches <- merge(pitches, players, by="batter_id", all.x=T)

Notice that it puts the "batter_id" variable in the first row of this new data set. That's okay, and you can always restructure your data if this bothers you. Now let's do the same for the pitchers in the pitch data. Don't forget to rename the variables in your player information table so that they don't overwrite the batter information, and also so that it matches on pitcher id, rather than batter id:

##rename columns for pitchers
colnames(players) <- c("pitcher_id", "p_first", "p_last", "p_height", "p_weight", "p_birth_year", "p_pro_played_first", "p_mlb_played_first", "p_mlb_played_last")

##do merge for pitchers
pitches <- merge(pitches, players, by="pitcher_id", all.x=T)

Now, looking at the data, my first row has the 69 inch, 180 pound Dustin Pedroia against a lanky 72 inch, 160 pound Miguel Bautista. For this pitch, Pedroia gets a hit. You can even double check that the players are correct by looking at the "ab_des" column, which gives a full description of what happened in the at bat. Sure enough, it says, "Dustin Pedroia singles on a line drive to left fielder Ryan Langerhans. J. Drew to 2nd.". Things seemed to have gone well here. Now, you can save the new file so you don't have to worry about merging again with the following code:

##write new table
write.csv(pitches, file="mergedpitches.csv", row.names=F)

Hopefully this will help out some of those looking to merge data together. There is much of this needed with the different data sets (pitch f/x, Retrosheet, Baseball Reference, etc.) around the web. You'll need a full mapping of all player ids. I got mine from the Universal ID Project, here is a link at The Book Blog for last year's version (I can't find the most recent link).

In the end, R's functionality here is better than any other program that I have come across. You always need to double check the data to make sure there aren't any bugs. This is especially true with even larger data. Ultimately, this can make life in R and baseball analytics about a million times easier--just be careful. There are a few things I didn't go over here (like having it automatically sort when merging), so you can always check out how to use the function yourself with the R command "help(merge)". Hope this helps!

Pretty R Code:
############################# ################Sidetrack for Merging of Data Tables #############################   #set working directory setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")   ##load pitch file pitches <- read.csv(file="PitchesMerging.csv", h=T) head(pitches)   ##give an idea of hte amount of work that manually merging would take length(pitches[,1]) length(unique(pitches$batter_id)) length(unique(pitches$pitcher_id))   ##load player information file players <- read.csv(file="detailedplayers.csv", h=T) head(players)   ##rename columns for batters colnames(players) <- c("batter_id", "b_first", "b_last", "b_height", "b_weight", "b_birth_year", "b_pro_played_first",  "b_mlb_played_first", "b_mlb_played_last") head(players)   ##do merge for batters pitches <- merge(pitches, players, by="batter_id", all.x=T) head(pitches)   ##rename columns for pitchers colnames(players) <- c("pitcher_id", "p_first", "p_last", "p_height", "p_weight", "p_birth_year", "p_pro_played_first",  "p_mlb_played_first", "p_mlb_played_last") head(players)   ##do merge for pitchers pitches <- merge(pitches, players, by="pitcher_id", all.x=T) head(pitches)     ##write new table write.csv(file="mergedpitches.csv", row.names=F)

Tuesday, June 7, 2011

Off to Greece

After being back in the U.S. for two days, I'm headed off to Athens tomorrow for another conference. It is rather small, but I couldn't pass up the chance for some practice at presenting publicly and, well, going to Greece! Though, I was relatively impressed with the Richmond Street nightlife in London, Ontario.

If you're in the area (not likely, but I know there are some international readers here), stop by. It will be at the St. George Lycabettus Hotel in Athens (they really know how to do it in Greece!). The conference is put on by ATINER and the general topic is Tourism. I am again presenting with Dr. Mark Rosentraub. Below is the title of the presentation:

Measuring the Local Economic Benefits of Regional Assets: Opportunity Costs and the Best Use of Land for Regional Development

That also means there probably won't be any sab-R-metrics articles up until after I get back (I'll return on June 16th). Hopefully I can get on a roll after that, as I only have one more conference to go to in the summer (Joint Statistical Meetings in August in Miami Beach--the Sport Sections are highly recommended for you sports guys).

Monday, June 6, 2011

A Sabermetrics Prediction Competition at Kaggle?

I ran across this post today at Big Computing (now on the sidebar). I've toyed around with the Kaggle competitions in the past, but haven't really been able to come up with serious competition beyond the basic data mining tools available in R. They work great, but there are some serious programmers that develop their own classifiers and prediction tools that outclass anything I can do (especially in my free time).

Anyway, there is a mention about a sabermetric prediction competition. I know there are plenty of people around here that would have a lot of fun with something like this. If you haven't been to Kaggle before, I highly suggest checking it out. They give out money for the top predictive techniques. They provide the training data, and a hold out test sample for the leader board. Most recently, there is a Heritage Health Prize, with the winner getting a multi-million dollar prize!

They're asking for suggestions, and here is mine:

"I really think using FX data would be a good road for this. For the most fun, it may be interesting to predict whether a single pitch is made contact with or not, given the game state, type of pitch, count, the opposing batter abilities, pitcher ability, velocity, location, etc."

Any other thoughts?

Wednesday, June 1, 2011

Off to NASSM in Canada

Today I am off to the North American Society for Sport Management Conference in London, Ontario. While the group seems to have really missed the ball on the location (sorry guys and gals, but come on) I am excited to attend. This will be my first time at this conference and I am looking forward to the experience. If you are going to be there, I'd love to meet you. If you're in town, feel free to come on by my presentations. I have two, for which I have provided titles below (click the link for the abstract). The first presentation is Thursday at 8:30 am, while the second one is Friday at 8:45 am. Looks like I get to be the early bird all week...

Major League Baseball Franchise Attendance and the Uncertainty of Outcome Hypothesis
Brian Mills and Rodney Fort

Public Investment in Sports Facilities: Who Really Pays and the Implications for Progressive Taxation
Mark Rosentraub, Brian Mills, Michael Cantor and Jason Winfree

I will be presenting the first one on my own, while fellow graduate student Michael Cantor and I will be presenting the second one. Also look for these from our department (and one from a graduate of our department currently working at Illinois, Scott Tainsky):

Effects of Personal Involvement and Expert Information on Fantasy Sports Consumers’ Winning Expectancy and Anticipated Emotion
Dae Hee Kwak and Joon Sung Lee

Demand for Individual Sports: Estimating Pay-Per-View Buyrates for the Ultimate Fighting Championship
Scott Tainsky, Steve Salaga and Carla Santos

Gratitude toward Sponsors: Conceptual Framework and Empirical Examination
Yu Kyoum Kim, Robert Smith and Dae Hee Kwak

NCAA Football and the Invariance Proposition
Steve Salaga and Rodney Fort

Athlete Philanthropy: Motives, Drivers and Intentions
Seung Pil Lee, Kathryn Heinze, Kathy Babiak and Matt Juravich

Measuring the Contribution of Sport to Society: Social Capital, Collective Identities, Health Literacy, Well-being and Human Capital
Seung Pil Lee, T. Bettina Cornwell and Kathy Babiak

Quite a showing from our growing department!