For this series, I began with something pretty basic: loading in your data. While trivial for some, these sorts of things are not so easy for those not familiar with statistical programming. One thing I did forget to mention is that R now has a Data Editor view in it. Just go up to "Edit -> Data Editor". It will prompt you for the data name and a spreadsheet will open up. To edit anything, just double-click the cell. I would advise against using this, and just taking care of manual editing in Excel so you always have a saved record of the file with the correct changes. But if you are more comfortable viewing the data in a spreadsheet format like this, it may be a nice tool to have. R is becoming more user-friendly, with GUIs and Revolution out there for those not comfortable with programming language (and Revolution charges unless it is for academic use). But using the command line will give you much more flexibility.

Hopefully, we can get further down the R learning curve in the next couple posts and really apply R to fun problems in baseball analysis. But my hope is that these posts will cater to everyone new to--and experienced with--R. Today I will focus on calling and creating variables and vectors.

Go ahead and open up R as well as the script you saved from the last post. Your script should look something like this:

Go ahead and highlight the entire script and press "CTRL + R" to load the data and get back to where we left off last time. One thing I forgot to mention in the last post is that if you are having trouble with any R function, you can always type "help()" with the function name in parentheses. A help file will open up in your browser that tells you what every command within the function will do for you. At first, they are not straight forward to read, but the more you use R, the more helpful the help files will become. Eventually, you'll know exactly what to look at when you are having trouble with a certain function.

The first thing I'd like to go through is understanding how to locate certain cells (for example, Hank Aaron's career home run total). If you look at my spreadsheet from last time, we see that Hank Aaron is the first row of our actual data (R knows that the top row is headers, so in R, Hank Aaron is Row 1). There are a couple ways to get the information for Hank Aaron. First, I will assume that all the variable names are lower case and spelled out for easy readability here (for example, "HR" will be "homeruns" and "2B" will be "doubles" from now on). If we are interested in only looking at Hank Aaron's data, we can do the following:

#Hank Aaron's career statistics

hitters[1,]

This calls our first row in the data. If you just type this and press CTRL + R, then it should show only Hank Aaron's row of data in your command window. So, looking at the code above, we see that using the backets "[ , ]" with a number before the column calls that row in the matrix. If you remember matrix algebra from high school, you may remember some of this sort of notation. Remember "[rows, columns]". What if we're interested in home runs, irrespective of player? We would do something similar, but with columns:

#all players' home runs

hitters[,15]

We use "15" because the home run column is the 15th column in the data. If we just want Hank Aaron's home runs, we would simply type:

#Hank Aaron home run total

hitters[1,15]

This will just print out a number that we know is Aaron's. While the rows are numbered, counting across our data matrix is not always efficient, so we can call a column another way:

#calling column vector using variable name

hitters$homeruns

And, calling just Hank Aaron's HR data using variable names

#calling column vector using variable name

hitters[1,]$homeruns

And I'll get into this a little later with data subsetting (next time), but you can also call Hank's home runs using his first and last name, providing it is in the dataset:

#calling Hank Aaron's home runs using his name

hitters$homeruns[firstname=="Hank" & lastname=="Aaron"]

The code above tells R that we want to use the data set called "hitters" and the variable "home runs". This is shown using the "$" sign before the column vector (variable) that you want to use. You can do this for any of the variables in your data set. Unfortunately, in this case just listing all the home runs isn't very useful (the first coding above). We're not sure what home run total is assigned to which player! Maybe we want to create a new data set with the player names and home runs only. We have to use an assignment operator (<-) for this:

#create data with HR and Player Names only and name it "homer.totals"

homer.totals <- hitters[,c(1, 2, 15)]

head(homer.totals)

By using the "head()" command, we can check to be sure our new data matrix was created correctly. Did it work? If not, double check your code and try again. We can also use the variable names:

#create data with HR and Player Names only and name it "homer.totals.b"

homer.totals.b <- hitters[,c("first", "last", "homeruns")]

head(homer.totals.b)

********SIDETRACK********

Now for a bit of a sidetrack. Usually, we like to work with data that comes from a nice and pretty data table made in Excel. But, what if we want to create our own data vector? In the future, we'll need to do this for plotting and simulations, but I'll keep it simple for now. To create a single vector, we can use the function "c()". Let's say we'd like to just make a list of numbers of errors made in 5 random Yankees games in the 2010 regular season (I'll just make them up for now):

#####SIDETRACK

#creating vector of errors

yanks.err <- c(1,2,0,3,0)

yanks.err

This will look just like a list of numbers as a row. Now, let's also create a game number vector:

#creating vector of game numbers

game.num <- c(1, 45, 12, 16, 78)

game.num

Again, we see another list of numbers. If we want these matched up as columns, we'll need to use the function "cbind()". Similarly, "rbind()" will append new rows to the end of an existing dataset (assuming the same number of columns in each). When you do this, you need to be sure that you enter both vectors in the correct order, or they will not match up when they are bound together as columns.

#binding game number and yankees errors

dat.a <- cbind(game.num, yanks.err)

dat.a

Now the data is presented as two columns with 5 rows and the variable names at the top. R will read this as a matrix just like your data table and you can use the brackets to call columns and rows. However, this simple binding procedure does not treat the variable names at the top correctly. You can check this by running this code:

#checking if variable calls work for new data

dat.a$game.num

You will get an error that says you cannot do this for 'atomic vectors'. What to do? This is an easy fix. We just tell R that our new matrix is a data fram using the "as.data.frame()" function.

#making bound columns into a data frame

dat.a <- as.data.frame(dat.a)

dat.a

dat.a$game.num

dat.a$yanks.err

Hopefully, this worked out just fine for you and when you use this code, there is again a list for each of the variables. There are a number of other ways to do this, but I find this to be the easiest. Later on, I'll get to the "seq()" function that creates a sequence of numbers for applying a model to. This is used heavily when creating heat maps for Pitch F/X data.

*******END SIDE TRACK*******

Now that we know how to call variables from our data, I'm going to introduce you to a function that can do the following:

1) Makes it much easier to call variables

2) Cause many, many problems when using multiple data sets

I sometimes use the function "attach()". This allows you to call variable names without using the "$" sign and data set name. That way, you can get things done quicker and have less keystrokes. So what's the problem? Well, if you have more than 1 data set in R--especially one with the same variable names, say a Pitch F/X data set for Adam Lind ("lind") and for Vladimir Guerrero ("vlad")--R will only hold one in its memory at a time. If you are going back and forth with "attach()", R can get confused and things can go wrong. I recommend NOT using the "attach()" function. But if you wish, here is all you'll need to do:

#using the attach function for the data (BAD IDEA!)

attach(hitters)

homeruns

Using the above, all you have to do is type the variable name to get the full list of home run totals. Sometimes in my code I can't get a function to work correctly unless I use attach. That's probably my own insufficient R knowledge creeping up in some situations. Just remember to use extreme caution whenever using this function.

The last thing I'll show you today is how to create a new variable in your data. R works just like a super-duper calculator. So, if you want to use simple math functions, just type it in:

#adding

7 + 8

#subtracting

7 - 8

#multiplying

7*8

#dividing

7/8

#exponent

7^8

#square root

sqrt(78)

There are also many basic statistical functions that you can use for the variables in your data set. I'll list some below:

#average

mean(hitters$homeruns)

#standard deviation

sd(hitters$homeruns)

#variance

var(hitters$homeruns)

As long as you can remember your order of operations from elementary school and type them in correctly, R works great as a calculator. Since we often want to create new variables for our data (for example, maybe we want to calculate OPS), these can come in handy. We just need to add in an extra step and an assignment operator. Below, I calculate OPS and attach it as a new variable to the data set "hitters" in one swoop:

#creating OPS and attaching it

hitters$ops <- hitters$slug + hitters$onbase

head(hitters)

You can see that "ops" is now appended at the end of your data set. The "$" sign used to the left of the assignment operator indicates that you want to create this variable. Be careful not to name it something that already exists, or it will replace that column with your new function and you won't be able to recover it unless you re-open your original .csv file.

Easy as that. R knows that when you type something in this way, you are using the math functions across the rows. WE can calculate anything we want in this way individually for each player using a single line of code. Isn't that convenient! It's just like writing a function in Excel, except quicker. Play around with this yourself to get comfortable with it.

At the end of your session, remember to save your functional R code again in your Script editor. Everything together should look something like this (I have REMOVED THE ATTACH COMMAND so that it doesn't attach the data next time you run the code):

(HAVING HTML PROBLEMS, WILL ADD THIS LATER)

Hey Millsy!

ReplyDeleteI just found your blogs and I absolutely love them so far.

Just a quick question.. in your sidetrack after you have used the as.data.frame() command did you mean to say dat.b$game.num? Because I followed what you wrote and still came up with an error, but when I tried using dat.b it worked.

Thanks for all of the work you put into these; it's really appreciated.

-Lauren

Ah. Good catch. Sorry about that. I fixed it up.

ReplyDeleteReally, the so that you don't have a bunch of objects all at once, it's reasonable just to use the same name for the data, "dat.a". So, rather than name it "dat.b", just name it "dat.a" and it will use the same name as the original matrix, but treat it as a data frame. Then, the rest of the stuff should work.

I too have troubles with some functions not working without attaching the data. the plot command comes to mind. I would be interested in hearing other's experiences with this. Maybe in later blog posts there will be further comments.

ReplyDeleteGreat blog. Thanks.

When you plot the data, be sure to use the command "data=" then the name of your data. For example:

ReplyDeleteplot(y~x, xlab="", ylab="", main="", data=dataname)