Thursday, January 20, 2011

sab-R-metrics: Intermediate Boxplots and Histograms

Last week, I began talking about using the base graphics in R. Those graphics were pretty bland, and my hope for the next two posts is to introduce some interesting additions to the basic graphics that come from R: color, legends, lines, shapes, multiple graphs side-by-side, text, point types, and custom axes. If you have missed any of the previous sab-R-metrics posts, go ahead and start at the beginning, working your way through the posts below:

1. sab-R-metrics: Introduction
2. sab-R-metrics: Basics of Vectors and Data Calling
3. sab-R-metrics: Subsetting, Conditional Statements, 'tapply()' and 'for loops'
4. sab-R-metrics: Beginning with Boxplots, Scatterplots and Histograms


Last time, I did more with box plots than with histograms and scatterplots. I plan to work with box plots and histograms today in a little more detail, showing how the different graphing options work for each. Next time, I'll work with scatterplots, as there are a number of other things we can do with those.

I'll begin again with box plots. However, we're going to use new data this time. Hopefully, this will get you in practice with getting data, saving it correctly and in the right place, and calling it back up in R. For this post, I'll be using Pitch F/X data again (because it's so easily available at Joe Lefkowitz's site). I'm a big Shaun Marcum fan, so let's use his 2010 data. Click here for the direct link to the data.

Don't forget to change all the variable names at the top to lowercase letters and underscores (_) instead of spaces. Then, remember to save it in a directory that is convenient for you with an informative, yet simple file name. I named mine 'marcum10.csv'. Go ahead and open it up in R. My code calling the data and checking to make sure it imported correctly is below. I also made sure to subset the data that has all the Pitch F/X information that we want to use, leaving out missing valued observations:

#setting directory and opening Shaun Marcum 2010 Pitch F/X data

setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")
marcum <- read.csv(file="marcum10.csv", h=T)
head(marcum)
marcfx <- subset(marcum, marcum$start_speed > 0)
head(marcfx)


Okay, now we're ready to have some fun. First, go ahead and make a box plot of Marcum's four-seam fastball velocity by inning (using just the pitches coded 'FF'). Use the version we last left off with when working with the Pujols data last time, but change the title of the plot so that it is telling us the correct player. Here is what it should look like to start:

##boxplots & histogram of fastballs (generic FA version) only with width by num. of obs.

boxplot(pitchfx$start_speed[pitchfx$pitch_type=="FA"] ~ pitchfx$inning[pitchfx$pitch_type=="FF"], xlab="Inning", ylab="Speed Out of Hand", main="Shaun Marcum Fastball Speed by Inning", varwidth=T)

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed")


Again, the 'varwidth=' option is pretty neat as we can see how many pitches Marcum throws each inning. But they're still pretty boring. Perhaps we should begin here by adding some color. For this, we'll use the "col=" option. Marcum is a newly minted Brewer, so lets use the old school Brewer blue (or something close to it). I've also added the histogram with color below:

##adding color to the boxplot

boxplot(pitchfx$start_speed[pitchfx$pitch_type=="FA"] ~ pitchfx$inning[pitchfx$pitch_type=="FF"], xlab="Inning", ylab="Speed Out of Hand", main="Shaun Marcum Fastball Speed by Inning", varwidth=T, col="blue3")

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col="blue3")


Using this one color didn't do too much, so maybe we want to use both of the Brewers' old colors. Let's add some gold:

##add another color

boxplot(marcfx$start_speed[marcfx$pitch_type=="FF"] ~ marcfx$inning[marcfx$pitch_type=="FF"],
xlab="Inning", ylab="Speed Out of Hand", main="Shaun Marcum Fastball Speed by Inning", varwidth=T, col=c("blue3", "gold"))

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col=c("blue3", "gold"))


Notice that for multiple colors, we use the "c()" function, which tells R that we want to use a vector of colors. The box plot function knows that we want them to alternate if it is typed in this way, and voila, pretty! Of course, we can also go overboard. For example, this code:

##ugly rainbow color

boxplot(marcfx$start_speed[marcfx$pitch_type=="FF"] ~ marcfx$inning[marcfx$pitch_type=="FF"], xlab="Inning", ylab="Speed Out of Hand", main="Shaun Marcum Fastball Speed by Inning", varwidth=T, col=c("red", "blue", "green", "yellow", "purple", "orange"))

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col=c("red", "blue", "green", "yellow", "purple", "orange"))


Honestly, this is hideous. The lesson here is that you CAN use too much color. Remember to use color to help portray the information you are trying to communicate, rather than just to make things bright. There are a LOT of color options in R, which I will go over more specifically in a later post, taking advantage of RColorBrewer and transparency options for the colors. When creating a graphic, the idea is to consolidate the information that you want to communicate into an easy to see picture that really makes your point stand out. In our box plot example, the color does not give us a whole lot other than sprucing it up a bit. That is fine, but in my opinion it makes it a little easier to look at, while showing our main point here: Shaun Marcum's fastball velocity goes down a bit while the game progresses (leaving aside the sampling bias issue for now related to the fact that when he is allowed to stay in longer, it is likely when he has a bit more stamina to do so).

Okay, so we've got a little color in the plots, but there are other color options as well. For example, perhaps we don't want the borders of the boxes to be the black default. We can go ahead and make the Brewers colors alternate with the borders as well (though, the gold is a bit ugly here on both plots). I also change the axis labels using the "names=c()" option in the "boxplot()" function to say "1st, 2nd..." instead of just numbering them.

##adjusting box border colors

boxplot(marcfx$start_speed[marcfx$pitch_type=="FF"] ~ marcfx$inning[marcfx$pitch_type=="FF"], xlab="Inning", ylab="Speed Out of Hand (mph)", main="Shaun Marcum Fastball Speed by Inning", varwidth=T, col=c("blue3", "gold"), border=c("gold", "blue3"), names=c("1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th", "9th"))

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col=c("blue3", "gold"), border=c("gold", "blue3"))


There are a lot of options for adjusting your box plots, and I again would recommend keeping things simple. I am not a fan of the extra border color, but it is up to you to convey your information as you please.

As you have seen from the plots I display here, I have both the box plot and histogram in the same window side-by-side. This is a very useful feature in R, allowing us to compare certain graphs to one another (for example, pitch locations of fastballs and curveballs side-by-side...but I'll get to these plots later). To make plots within the same window, we use the "par(mfrow=c(,))" function. Here, the number of rows of graphs you want in your plot is indicated by the first number in the "c(,)" command, while the number of columns is after the comma (just like when calling vectors and cells from our data). So, if we want two graphs side by side, we'd type:

#side-by-side graphs
par(mfrow=c(1,2))

boxplot(marcfx$start_speed[marcfx$pitch_type=="FF"] ~ marcfx$inning[marcfx$pitch_type=="FF"], xlab="Inning", ylab="Speed Out of Hand (mph)", main="Shaun Marcum Fastball Speed by Inning", varwidth=T, col=c("blue3", "gold"), border=c("gold", "blue3"), names=c("1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th", "9th"))

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col=c("blue3", "gold"), border=c("gold", "blue3"))


Now, remember that the order of your graphs will go left to right, top to bottom, so be sure to order them the way you want after the first command. The above plot should look just like the previous one shown for the new border colors.

We can also make the graphs stacked on top of one another (less desirable in my opinion), or in a 2-by-2 matrix of plots.

#stacked graphs
par(mfrow=c(2,1))

#2-by-2 matrix of graphs
par(mfrow=c(2,2))


And you can extrapolate from here, adding each graph just after the command. Of course, you don't want to put too many graphics within the same window, as they would be too small or trying to convey too much information at once. I usually limit to 1-by-3 or 2-by-2 max in a single window. But again, optimal ways of displaying your points really depends on the type of data you are working with. As I've said throughout, keep things easy to read so that they convey your message as simply and concisely as possible.

The last thing I want to go over today is saving your graphics. R has it's own window where the graphics pop up; however, copying and pasting these into Word and other places can make them blurry. In addition, if you adjust the window size (sometimes you'll want to do that to make them look right), you may never be able to get that exact window size back again by the click-and-drag procedure. Therefore, it is usually a good idea to save your graphics using an R graphical device. I'll talk about two of them here.

The first one I learned to use was for creating a PDF of your graphic. However, I later learned that using .png files is the best for keeping your graphics crisp. Both have advantages I'll talk about later on, but for single graphics, PNG seems to be the way to go. The most important part about saving using these graphical devices is that you can ALWAYS make the graphics the same size, rather than rescaling them in the R window or later on in Word or another program. When we want to save a graphic, we'll start with the following:

##USING PNG and PDF to save graphics

#saving graphic as PDF (height and width are INCHES!)
pdf(file="marcumhist.pdf", height=6, width=7)

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col=c("blue3", "gold"))

dev.off()


#OR saving graphic as PNG
png(file="marcumhist.png", height=750, width=750)

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col=c("blue3", "gold"))

dev.off()

Always be sure to turn off your device after you are finished with it. While you will not see your graph within R, you can go to your working directory and find the file that you just created. You'll want to fiddle around with the height and width depending on your graphic to get it looking the way you'd like. There are a number of other options for each device, and I would suggest using "help(pdf)" or "help(png)" for more information on how to adjust the background and foreground colors, among other options. For the PDF device, it will continue to add pages to the PDF file, which is convenient in certain circumstances I'll talk about later. However, if you leave the PNG device open, each time you run a new graph, it will overwrite the previous one. Remember that you can also use the "par(mfrow=c())" function within the device to make side-by-side graphs in the same window or on the same page in a PDF.

Next time, we'll work with Scatterplots and really get into adding text, legends, lines and shapes, and different types of points to your plots. After that, we'll get to some more advanced graphing and adjusting colors and color palettes. Today's code should look something like this:


setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")
marcum <- read.csv(file="marcum10.csv", h=T)
head(marcum)

marcfx <- subset(marcum, marcum$start_speed > 0)
head(marcfx)

##boxplots of fastballs (generic FA version) only with width by num. of obs.

png(file="nocolor.png", height=600, width=1000)
par(mfrow=c(1,2))
boxplot(marcfx$start_speed[marcfx$pitch_type=="FF"] ~ marcfx$inning[marcfx$pitch_type=="FF"],
xlab="Inning", ylab="Speed Out of Hand (mph)", main="Shaun Marcum Fastball Speed by Inning", varwidth=T)

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed")


dev.off()

##adding color to the boxplot

png(file="blue.png", height=600, width=1000)
par(mfrow=c(1,2))
boxplot(marcfx$start_speed[marcfx$pitch_type=="FF"] ~ marcfx$inning[marcfx$pitch_type=="FF"],
xlab="Inning", ylab="Speed Out of Hand (mph)", main="Shaun Marcum Fastball Speed by Inning", varwidth=T, col="blue3")

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col="blue3")

dev.off()

##add another color

png(file="bluegold.png", height=600, width=1000)
par(mfrow=c(1,2))
boxplot(marcfx$start_speed[marcfx$pitch_type=="FF"] ~ marcfx$inning[marcfx$pitch_type=="FF"],
xlab="Inning", ylab="Speed Out of Hand (mph)", main="Shaun Marcum Fastball Speed by Inning",
varwidth=T, col=c("blue3", "gold"))

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col=c("blue3", "gold"))

dev.off()

##ugly rainbow color

png(file="uglyrainbow.png", height=600, width=1000)
par(mfrow=c(1,2))
boxplot(marcfx$start_speed[marcfx$pitch_type=="FF"] ~ marcfx$inning[marcfx$pitch_type=="FF"],
xlab="Inning", ylab="Speed Out of Hand (mph)", main="Shaun Marcum Fastball Speed by Inning",
varwidth=T, col=c("red", "blue", "green", "yellow", "purple", "orange"))

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col=c("red", "blue", "green", "yellow", "purple", "orange"))

dev.off()


##adjusting box border colors

png(file="newborders.png", height=600, width=1000)
boxplot(marcfx$start_speed[marcfx$pitch_type=="FF"] ~ marcfx$inning[marcfx$pitch_type=="FF"],
xlab="Inning", ylab="Speed Out of Hand (mph)", main="Shaun Marcum Fastball Speed by Inning",
varwidth=T, col=c("blue3", "gold"), border=c("gold", "blue3"),
names=c("1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th", "9th"))

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning",
main="Histogram of Marcum Fastball Speed", col=c("blue3", "gold"), border=c("gold", "blue3"))

dev.off()

##USING PNG and PDF to save graphics

#saving graphic as PDF (height and width are INCHES!)
pdf(file="marcumhist.pdf", height=6, width=7)

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col=c("blue3", "gold"))

dev.off()


#OR saving graphic as PNG
png(file="marcumhist.png", height=750, width=750)

hist(marcfx$start_speed[marcfx$pitch_type=="FF"] , freq=FALSE, xlab="Inning", main="Histogram of Marcum Fastball Speed", col=c("blue3", "gold"))

dev.off()

Created by Pretty R at inside-R.org

4 comments:

  1. Great post, your tutorials are really helping me out. Just one quick question: is there a quick way to add labels to outliers beyond the whiskers in box plots? i've been poking around, but haven't been able to come up with anything that works.

    ReplyDelete
  2. in case anyone else is wondering about this, i found a quick and dirty way to do this -- just run identify(dataset$xvar, plots$yvar, rownames(dataset)) after generating the plot.
    this allows you to label the outliers with a left click directly in the plot.

    ReplyDelete
  3. Thanks for the comments and additions! Glad you were able to figure things out. Not sure I would have been much help there.

    ReplyDelete
  4. The Marcum link appears to be down
    The requested URL /pitcher_card.php was not found on this server.

    I did a script sometime ago on outliers. this was the start point
    http://stackoverflow.com/questions/7929542/boxplot-outlier-labeling-in-r

    ReplyDelete