Monday, January 31, 2011

sab-R-metrics: Some Extra Visualization Customization

Last post, I described a number of ways to show your data on a scatter plot. Ricky Zanker at THT has a similar post today for those looking to get some extra exposure and another take on R programming. Today, I plan to extend on this with a little more customization. First, if you've missed all of the previous sab-R-metrics posts, CLICK HERE to see them all. The code provided below should also get your data file back to where it needs to be to keep along here.


#setting directory and opening Shaun Marcum 2010 Pitch F/X data

setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")

marcum <- read.csv(file="marcum10.csv", h=T)
head(marcum)

#subset for main pitches
marcfx <- subset(marcum, marcum$start_speed > 0)
shaun <- marcfx[marcfx$pitch_type %in% c("FF", "FC", "FT", "CH", "CU"),]



When we left off last time, you most likely have something in your R code that produces the following graphs:


We worked with different text, points, and legends as you can see above. Depending on your purposes, text can be extremely useful. But you know what, I'm a little tired of these "Start Speed vs. End Speed" plots, so let's get into plotting Pitch Location instead. That way, we'll have a need for things like lines and shapes (to show the strike zone).

For the purposes of these posts, I won't normalize the strike zone height. I'll simply use the average top and bottom of the zone that Marcum saw for the entire season. All this means is that one needs to be a little more careful when determining the top and bottom of Marcum's strike zone. In addition, I often use the book strike zone on my plots. Others have shown that the actual 'strike zone' for umpires is about a foot across, which is a little wider than the plate. That's fine as well, but I prefer to have the book zone on my plots. Just keep that in mind when saying something is outside the strike zone. Let's start with some knowledge of the variables in our data set.

px: Vorizontal location (in feet). Negative numbers mean the pitch is to the left of the center of the plate, while positive numbers indicate the pitch is to the right. Keep in mind my left/right comparison is from the catcher view.

pz: Vertical location (in feet). This is the height of the pitch from the ground when it crosses the front of the plate. There are some negative numbers here, which from my knowledge (those out there correct me if I am wrong) indicate a ball bouncing in front of the plate. Because the system extrapolate the pitch from a little ways before the front of the plate, we see these negative numbers.

sz_top: This indicates the top of the strike zone as relatively subjectively drawn by a person. If you fiddle with a full F/X database, you'll see that the top varies sometimes for the same player. Here, we'll use the overall average of this in Marcum's data for the top of the zone.

sz_bot: This is just like the top of the zone, but for the bottom. Should be drawn around the knees of the player.

pitcher_handedness: I think this is pretty straight forward. It is a string/text variable entered as "L" or "R". Remember that we'll be viewing from that catcher/umpire view in our plots.

batter_handedness: Again, pretty straight forward. It is a string/text variable entered as "L" or "R". Remember that we'll be viewing from that catcher/umpire view in our plots. To make things more complicated, from a catcher or umpire's view facing out toward the pitcher, a right-handed batter stands to the catcher's left, while a left-handed batter stands to the catcher's right.

balls: This indicates the number of balls in the count when the current pitch is thrown.

strikes: This indicates the number of strikes in the count when the current pitch is thrown (i.e. the pitch in a row with balls=0 and strikes=0 indicates that the pitch was thrown in an 0-0 count. If the pitch was called a strike, then the count after this pitch was thrown will be 0-1).

outs: This indicates the number of outs when the current pitch is thrown.

pitch_result: This gives a text interpretation of what happened for this pitch. For this post, I'll be conditioning on this variable simply for "Called Strike", "Ball", and "Swinging Strike".

result_type: This is a one-letter indication of the result of the pitch. S indicates a strike, B indicates a ball, and X indicates that the pitch was put in play.

Okay, now we're set. Let's use the skills we've been working on in the last bunch of posts and draw a standard scatter plot using the px and pz variables. Remember to make "pz" a function of "px" (or similarly, px=x and pz=y) when writing your code. For now, let's keep it basic and just plot all the pitches from the 'shaun' subset of the data using "plot()".

##create a basic pitch location plot
png(file="location1.png", height=675, width=540)

plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", cex=2, pch=1)


dev.off()



As usual, I started pretty bland. But this is always good to do simply to make sure your data is plotting correctly before you go on with lots and lots of details. Again, we have some trouble distinguishing pitch type with the above plot, so we can either add text instead or vary the color/point type for each pitch as we did last time. I prefer the latter, but I show both below along with the code to make them.


##create a location plots using a legend and using text

png(file="location2.png", height=675, width=1080)

par(mfrow=c(1,2))


plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)",
ylab="Vertical Location (ft.)", type="n")
text(shaun$px, shaun$pz, shaun$pitch_type, col=as.numeric(shaun$pitch_type))

plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))

legend(-2.95, 0.1, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

dev.off()



This much data isn't particularly enlightening, and it is why density-based heat maps are a better representation when dealing with larger data sets (I'll get to those a bit later, but I do provide the code for mine). To reduce our data a bit, why don't we go ahead and condition the data on being a called strike. See below for the code and the plot it should produce:

##condition the above on being a called strike
png(file="location3.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))

legend(-2.95, 0.1, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

dev.off()



Quick visualization quiz: do you notice any problems (outside of the fact that Yellow just isn't a good choice on a white background)? Before I get to drawing some shapes or lines to indicate a strike zone, I want to talk about customizing the axes. Notice in the above plot that the axes only go from -1.5 to 1.5 on the horizontal and 1.5 to 4 on the vertical. This is because the ranges of our px and pz values are much smaller for only pitches that are called strikes. This makes intuitive sense: called strikes should tend to be closer to the plate on average.

Sometimes, we may not be worried about this, as this range helps to distinguish the points in the plot a little better. However, when doing visualizations we often compare things to one another. Therefore, we want things to be exactly the same, except for the difference we are trying to highlight. Otherwise, the way people look at the plots could be biased. This is not only true when comparing across plots, but also when looking on the same plot. Therefore, we also need to consider our PNG file 'width' and 'height' to ensure that the tick marks on the X and Y axes represent the same distance between them. If we stretch the image too tall, this could make it look like vertical feet are actually longer than horizontal feet (in fact, this is the case in the plot above!). Of course, we know this isn't how it should look in real life! What to do?

Luckily, R is very flexible and also allows us to control the axes in our plots. For the simple "plot()" function, this is straight forward. For other functions downloaded from the CRAN site (I'll get into this later), this issue is not quite as straight forward. But let's start simple. While I like the above dimensions for a Called Strike only plot, let's put this next to a Called Ball only plot with the same dimensions.

Below I do this using the "xlim=c(,)" and "ylim=c(,)" options for our plotting function. In the "c(,)" portion, set the minimum and maximum values for each axis separated by a comma. Generally, I like to use -2 to 2 for the horizontal axis and 0 to 5 for the vertical axis in my heat maps. In this data and for a simple scatter plot, these dimensions won't cover every single pitch, but they will get pretty close. If you are unsatisfied with the dimensions below, fiddle around with your own...just keep in mind that I have the PNG device set up to make the axes have the same width between tick marks.


##create two plots with the same dimensions
png(file="location4.png", height=700, width=1080)

par(mfrow=c(1,2))


plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Pitch Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))

legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)


plot(shaun$pz[shaun$pitch_result=="Ball"] ~ shaun$px[shaun$pitch_result=="Ball"], main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))

legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)
dev.off()


In the code above, I already manually calibrated the PNG file creator to make the horizontal and vertical aspect ratio the same distance apart in the visual, given the data range we used (also, for a single plot with these dimensions, you'll want to use "height=675" and "width=540"). This is one of the reasons to be sure to use a graphical device: you can save the size and be sure that everything is comparable later on. Also, remember to set your legend somewhere that it won't overlay on your data. In the end, the visuals above look pretty much like we would expect don't they? Called strikes around the middle, with called balls making a big circle around the middle.

Returning to just our "Called Strikes" plot, it is always instructive to put some sort of indication of a strike zone on the plot. For just drawing a simple zone, I like to use the book zone. For drawing onto a plot in the window, we can use "lines()" or "abline()" as a beginner step. However, the "rect()" function is a bit more convenient.

For the strike zone, we want to draw 4 different lines: one for the top of the zone, one for the bottom, and one for each side of the plate. Googling a bit will tell you that the width of the plate is 17 inches, or 1.4167 feet. Halfway across we have 0.708333 feet. Therefore we'll want to draw a line at -0.708333 and 0.708333 (the distance to the left and right of the center of the plate where the strike zone ends). In addition, we'll want to draw a line at the top and bottom of the zone, which here I'll indicate simply using the average of each for the entire data set.


Beginning with the "abline()" function, this tells R to draw a straight vertical or horizontal line across your entire plot. After plotting your pitches, you can draw these lines right onto the plot using this function (as well as the others I will go over). Below is some code that will do this, and also divides the entire plot up into 9 boxes. I also include the parameters for the lines "lty=" and "lwd=". These specify the line type and line width, respectively. The default for lty is "solid", while the default for lwd is 1. We can do a number of line types, including dotted and dashed. Of course, we can again use the color parameter for lines.

##add in a strike zone using "abline()" first
png(file="location5.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))

legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)


abline(h=mean(shaun$sz_bot), col="black", lty="dashed", lwd=2)

abline(h=mean(shaun$sz_top), col="black", lty="dashed", lwd=2)
abline(v=-0.708335, col="black", lty="dashed", lwd=2)

abline(v=0.708335, col="black", lty="dashed", lwd=2)


dev.off()



Notice above that for the lines drawn horizontally across the plot, I use the "abline(h=)" version, while vertical lines are drawn with the "abline(v=)". You can already see that the strike zone umpires in MLB call tends to stretch outside the 'book' zone by a few inches. Sometimes, we may not want the strike zone lines to extend throughout the entire plot. For this, we can make use of the "lines()" function just as we do with "abline()". However we'll need a few more specifications included here.

For this function, you'll need to specify not just where the line should be drawn vertically or horizontally, but also the endpoints for the line to stop at. We'll make use of the "c()" command here. For example, the first "lines()" code line says, "Draw a vertical line at 0.708335 on the x-axis from the average bottom strike zone to the average top strike zone on the y-axis and make it black, dashed, and a width of 2".

##use the "lines()" specification
png(file="location6.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))


legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)


lines(c(0.708335, 0.708335), c(mean(shaun$sz_bot), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, -0.708335), c(mean(
shaun$sz_bot), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, 0.708335), c(mean(shaun$sz_bot), mean(shaun$sz_bot)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, 0.708335), c(mean(
shaun$sz_top), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)

dev.off()



While the above code works great, it's not all that efficient. It takes a while to make sure we're drawing everything correctly. Luckily, R also allows us to simply draw a rectangle. Ultimately, the "rect()" function also allows us to fill it with color (and transparent color). However, I'll leave dealing with color and color palettes for another day. Here, we can use it to draw the strike zone in a single line of code. The parameters for the rectangle are similar to how we specified the lines above, but here each 'number' is a corner of the box. The other arguments are as I described above for the border of the rectangle, with "border="black"" saying that we want a colored border for our rectangle. I don't post another plot, as it looks exactly the same as the previous one.

##now add in a strike zone using "rect()"
png(file="location7.png", height=675, width=540)


plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))

legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)


rect(-0.708335, mean(shaun$sz_bot), 0.708335, mean(shaun$sz_top), border="black", lty="dashed", lwd=2)


dev.off()



At this point, I have discussed most of what I had hoped with base graphics and scatter plots. There are a plethora of options I haven't discussed, and I have not gotten to the "lattice" package or the "ggplot2" package which can significantly improve plotting in R. Lattice is particularly useful for looking at multiple plots at once. Those are a bit down the road for now, and I am still a beginner with ggplot2.

Next time, I'll get to making lines using time series type data. From there, my hope is to get into some actual statistical analysis in R and some non-parametric methods (like loess or density estimation). Below, I've posted a pretty version of the code:

#########################################

####Line and ABlines, Points, Shapes, and Custom Axes
#########################################

#setting directory and opening Shaun Marcum 2010 Pitch F/X data
setwd("c:/Users/bmmillsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")
marcum <- read.csv(file="marcum10.csv", h=T)
head(marcum)


#subset for main pitches
marcfx <- subset(marcum, marcum$start_speed > 0)
shaun <- marcfx[marcfx$pitch_type %in% c("FF", "FC", "FT", "CH", "CU"),]


##create graphs to review from last time
png(file="ReviewMarcum.png", height=850, width=1500)
par(mfrow=c(1,2))

plot(shaun$end_speed ~ shaun$start_speed, type="n", main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type)
text(shaun$start_speed, shaun$end_speed, shaun$pitch_type, col=as.numeric(shaun$pitch_type), cex=2)

plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), cex=2)
legend(68, 84, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=2)

dev.off()

##create a basic pitch location plot
png(file="location1.png", height=675, width=540)
plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)",
ylab="Vertical Location (ft.)", cex=2, pch=1)
dev.off()

##create a legend and use text
png(file="location2.png", height=675, width=1080)
par(mfrow=c(1,2))

plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)",
ylab="Vertical Location (ft.)", type="n")
text(shaun$px, shaun$pz, shaun$pitch_type, col=as.numeric(shaun$pitch_type))

plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))
legend(-2.95, 0.1, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

dev.off()


##condition the above on being a called strike
png(file="location3.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))
legend(-1.4, 3.9, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

dev.off()


##create two plots with the same dimensions
png(file="location4.png", height=700, width=1080)
par(mfrow=c(1,2))

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))
legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

plot(shaun$pz[shaun$pitch_result=="Ball"] ~ shaun$px[shaun$pitch_result=="Ball"], main="Shaun Marcum Called Ball Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))
legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

dev.off()


##add in a strike zone using "abline()" first
png(file="location5.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))
legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

abline(h=mean(shaun$sz_bot), col="black", lty="dashed", lwd=2)
abline(h=mean(shaun$sz_top), col="black", lty="dashed", lwd=2)
abline(v=-0.708335, col="black", lty="dashed", lwd=2)
abline(v=0.708335, col="black", lty="dashed", lwd=2)

dev.off()

##use the "lines()" specification
png(file="location6.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))
legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

lines(c(0.708335, 0.708335), c(mean(shaun$sz_bot), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, -0.708335), c(mean(shaun$sz_bot), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, 0.708335), c(mean(shaun$sz_bot), mean(shaun$sz_bot)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, 0.708335), c(mean(shaun$sz_top), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)

dev.off()


##now add in a strike zone using "rect()"
png(file="location7.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))

legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

rect(-0.708335, mean(shaun$sz_bot), 0.708335, mean(shaun$sz_top), border="black", lty="dashed", lwd=2)

dev.off()

Created by Pretty R at inside-R.org


Fangraphs Heat Maps Using R?????

Fangraphs has a new capability with Pitch F/X data that I was hoping they would provide at some point in the near future. It looks like they're using the R function smoothScatter which I worked with here at my blog a while back. I also presented these at IIATMS. Some others have highlighted the use of the function for Pitch F/X analysis (including Dave Allen in a Fangraphs post and Harry Pavlidies in a THT post). Some notes on this function and the Fangraphs page:

1. I do like the color scheme on the plots.

2. As noted in the comments, there need to be some axis labels.

3. One of the things I disliked about the function was the difficulty with setting my axes correctly. Others have noted this problem when trying to use my R code through emails to me.

4. The smoothing parameter doesn't seem to work well for outlying points in the function. Therefore, you end up with dots on the outer edges, which kind of disrupts the point of the heat map display.

I was wondering if Fangraphs would implement this with Dave Allen and Albert Lyu on the staff there. I'm not sure if they used some of the basic code from my site, but it would be kind of cool. One of the things I wondered about sharing my ideas, releasing R code and building tutorials here for free was if it would reduce any sort of competitive edge I had knowing R and being able to reproduce things that Dave and Albert (and Jeremy Greenhouse) do at Baseball Analysts and Fangraphs. Interestingly, it has done the opposite and I've had some fantastic inquiries about the things I post here. I look forward to any improvements in the heat maps.

I think the Fangraphs feature is a great tool, but I am monitoring a conversation on Twitter that indicates some hardcore analysts are worried about the repercussions of non-experts using the maps. There is a lot of work to be done with respect to bias in data based on a number of factors. Mike Fast has highlighted this in the past. They're great to look at, but I agree that making too many inferences becomes dangerous. And these are simply location maps, which leaves the possible inferences to a minimum. Hopefully those looking at them understand this.

Addendum: Dave Allen tells me that R may be a bit slow for feeding these things through at Fangraphs. So I'm actually not sure now if R is being used.

Tuesday, January 25, 2011

Exploiting H2H Rules at Fantasy Ball Junkie

Another note for today: my next post in the "Exploiting Rules and Structures" series is up over at Fantasy Ball Junkie. This time, I take on intricacies of Head-to-Head Category leagues. In fact, there are 3 new posts today by all contributors there!

Here is the link again:

H2H Categories Leagues

sab-R-metrics: Intermediate Scatter Plots

First off, I'll say it's been a whirlwind of a past few days. Thanks to David Smith at the Revolutions Blog for his kind words about the sab-R-metrics series and link back this way. Add in Ed Kupfer's posts at the APBRmetrics board, Harry Pavlidis at THT, Dave Allen at Fangraphs and about 30 Twitterers, I've seen some serious increase in site traffic. I've gotten a lot of great feedback on the blog and through email and I appreciate all of those who read this.

Last time, I left you with some code for creating box plots and histograms using the Shaun Marcum Pitch F/X data. For this post, I'll be using the same bunch of data sub-setted to include only those data with pitch speed/location information available. For those that have missed the last 5 posts or need to go get the data, the link below will take you to all of them:

sab-R-metrics Series

I'll try to use only functions and commands that were used in previous posts, so if you're not sure about something here, check the previous posts out. If you can't find it, feel free to comment to shoot me an email.

Go ahead and open up your data file and subset the data like the following (of course, calling from your OWN directory):

#set working directory, load data, and subset data
setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")


marcum <- read.csv(file="marcum10.csv", h=T)

head(marcum)


subset(marcum, marcum$start_speed > 0)

head(marcfx)


The above should be exactly what we worked with last time. However, for today's purposes, I am going to clean the data up just a little bit more to only include Marcum's 5 main classified pitchs: Four-Seam, Cutter, Change, Curveball and Two-Seam. This will make it a little easier to work with color in the plots later on. I'm going to just name this "shaun" to ensure there isn't any re-assigning of objects in R. To do this, you can use the following rough code (an extension from the 'sub-setting' tutorial):

#cleaning up the data
fourf <- subset(marcfx, marcfx$pitch_type=="FF") #use 'four' because ff is a function in R
fc <- subset(marcfx, marcfx$pitch_type=="FC")

ft <- subset(marcfx, marcfx$pitch_type=="FT")

cu <- subset(marcfx, marcfx$pitch_type=="CU")

ch <- subset(marcfx, marcfx$pitch_type=="CH")


shaun_b <- as.data.frame(rbind(fourf, fc, ft, cu, ch))

head(shaun_b)


However, there is a little quicker way to do this (which I just discovered, believe it or not!). I've always had trouble with "or" statements in R. For "and" statements, you can just use "&" in many cases. However, the solution for "or" isn't as straight forward. I finally found a solution for myself here (one of the reasons I wanted to do these tutorials in the first place: so I can focus myself on learning some basic functions I may have missed in my intro stats with R class). I use "%in%" to indicate to R that I want to grab only those rows for which the pitch type is the ones listed above:

#one-line way to subset for these conditions
shaun <- marcfx[marcfx$pitch_type %in% c("FF", "FC", "FT", "CH", "CU"),]


ADDENDUM:
#another way to do 'or'
shaun_c<- subset(marcfx, marcfx$pitch_type=="CH" | marcfx$pitch_type=="CU" | marcfx$pitch_type=="FF" | marcfx$pitch_type=="FC" | marcfx$pitch_type=="FT")

Okay, now we're all set. Let's begin by reproducing the scatter plot we made for the Albert Pujols data, showing the ending speed on the y-axis and the starting speed on the x-axis. Don't forget to include axis labels and a title.

#plotting end speed as a function of starting speed
plot(shaun$end_speed ~
shaun$start_speed, main="Shaun Marcum Pitch Speed", xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)")



As usual, the basic function is a little boring. Of course, the plot isn't particularly useful either. It simply tells us that the faster the ball comes out of the pitcher's hand, the faster it will be traveling when it crosses the plate. For now, that is okay with me. We see a nice linear relationship in the data. However, we can add a little more information to this plot by using color.

Sometimes you can use color to make things a little more exciting, but we need to be careful in this situation (thank you to David Smith for linking back using this horrid example of what NOT to do). Last time, I used the Brewers' colors for boxplots and histograms to brighten things up, but it was also useful in another way that I failed to mention. How? Well, let's say we have two box plots side-by-side. Plot titles should always be there, but if we're comparing Josh Beckett (Red Sox) and Shaun Marcum (Brewers) we can make the discrepancy more apparent and signal who is who with color. By filling the boxes with red and blue, respectively, it's easier for the reader to know which one is Beckett and which is Marcum, assuming they have some knowledge about team colors.

For now, let's continue with just making the points in our plot blue, and slowly get more advanced with the colors:

#blue scatter plot
plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed", xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col="blue")


Unfortunately, this does not convey much information for us as it may if we had it next to a red Josh Beckett plot. If you don't like the open circle points in the graph, you can also change those with the "pch=" option. Fiddle around with different numbers if you'd like and see what you come up with. In addition, the points are a little small for my liking in the dimensions I have set up for my double-window png file. Therefore, I made them bigger using the "cex=" option. The default is 1, so by making this equal to a larger number will grow the size of your points. Below, I show larger versions of both filled circles and filled triangles:

#make a two-window scatter plot with different point types and larger sizes

par(mfrow=c(1,2))


plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col="blue", pch=19, cex=2)

plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col="blue", pch=17, cex=2)




Now let's try to get a little more information out of our plots. One way to do this is use the options in R to indicate which type of pitch lies where on our plot. We can do this in 3 simple ways: using color, using shapes, or using text. The first and second option go well together, while using text on a plot is usually better suited for graphs that have fewer points (in order to be able to read it). Here, I'll just show what you can do with the "text()" function in visualizations. It is often useful for labeling points of interest as well. Here, I will utilize an option in the "plot()" function that suppresses plotting the points on the graph. I highlight this in red in the code below. We can also include BOTH points and text, but with the current data that gets way too messy. All you would have to do is remove the red colored option in the code below:

#draw text instead of points

plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed", type="n", xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)")

text(shaun$start_speed, shaun$end_speed, shaun$pitch_type, col="blue")



As you can see, the text is a bit much here, but it might come in handy later on. We can at least see that fastballs (FF) are faster and change-ups and curves (CH/CU) are slower. It's good to see that what we've done thus far at least makes sense! Now, let's try indicating pitch types using color and/or point types (pch). Hopefully, this will help to maximize the information we can get out of this data visualization. Let's begin with color.

For this will again use the "col=" option we learned last time; however, it will get a little more complicated. Here, we'll need to tell R to color code by pitch type. Luckily, if we tell R that "col=shaun$pitch_type" it will know what to do. Let's try it below. Doing things this way results in a problem that I want you to try and think about before heading to the next paragraph...

#adding color by pitch type

plot(
shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed", xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type)


We can see that the plot shows 5 different colors for each of the 5 pitches in our data set. Unfortunately, we don't know which is which! For this, we have to understand what R is doing when we use this option for the colors.

When we assigned colors last time, we could have also used numbers rather than the names of the colors. For example, if we want everything to be red, we can use the command "col=2". When using character strings to assigning colors, R does this in alphabetical order. Here our data set had originally included 10 pitch types: "NA", "CH", "CU", "FA", "FC", "FF", "FT", "IN", "PO", and "SL" (to see this, simply use the code: summary(shaun$pitch_type)). Therefore, it assigned these 1 through 10. We removed all but "CH", "CU", "FC", "FF", and "FT", which we now know are numbers 2, 3, 5, 6 and 7.

"CH" is the first in alphabetical order, and is assigned to "2", which is red. The 3 goes to "CU" (second in the alphabetical order of pitch types for Marcum). Therefore, these are green. This goes throughout the pitches that Marcum throws (FC=5 (turquoise), FF=6 (yellow), FT=7 (pinkish-purple?)). If we want to check this, we can also create a new numeric version of our pitch type variable with the following code:

#create numeric version of pitch type
shaun$p_type_b <- as.numeric(shaun$pitch_type)
head(shaun)

Now that we know this, we need to figure out what each color is numbered in the R environment. I told you above because I had already snooped around. While I've found color keys for R online (showing the number and color name), they don't seem to match up with what I've said above. I imagine there are some R-Bloggers out there with more knowledge about the best way to decipher your colors than I. The best way to figure out which is which from what we have here is to use color AND text in your plot. When you have lots and lots of pitch types, this is tougher. But here with 5, we can do it pretty easily below.

#use text and color so we know what is what

plot(shaun$end_speed ~ shaun$start_speed, type="n", main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type)

text(shaun$start_speed, shaun$end_speed, shaun$pitch_type, col=as.numeric(shaun$pitch_type))



Okay, now that we know what is what and now that we know we don't want to have all text on the plot, how about we use both color AND point shapes to indicate pitch types. Just like the colors, we can use the "shaun$pitch_type" vector to make different shapes on our plot. The code below will do this, with an important part of the graph missing:

#points and colors for pitch types

plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed", xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))





The graph above is missing one thing: A Legend! Too often I see plots that don't tell me what each point or color actually means. Since we've done our own snooping around into the color and shape assignments, let's let the readers know what the hell we're talking about. There are a few ways to include a legend, but I prefer using the "legend()" function. I know Dave Allen generally uses colored and stacked text which is also cool. You can try doing that on your own with the "text()" function.

With the legend plot, we first want to indicate where we'll put it on our graph. Try to choose a placement using points for the x and y axes in the first two places of the function. This can be done like this:

"legend(x,y)"

where x and y are points on your axes. Of course, we'll need to specify some other options for something to show up. After our x-y coordinates for the legend, we want to specify what the colors and point types are telling us about the pitch types.

Below, I have the code for creating a legend based on our data and the plot using pch and col as in the other plotting functions that we have talked about. First, you need to plot your data, then use "legend()" to add it on just like we did with the "text()" function above.

#make a plot and add a legend for color and point types

plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed", xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))
legend(68, 84, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7))



Unfortunately, this post has gotten rather long, and I'd prefer not to put too much into a single post. I'd recommend playing with the different colors and point types in R to find what you like best. Since this is a long post, I'll save points, lines (and line/time plots), shapes, and custom axes for next time. At that point, maybe we can start getting into some basic statistics and smoothing for our visualizations. The code from today is posted below:


################################

########Marcum Scatterplots and Shapes and Lines and Stuff
################################

#setting directory and opening Shaun Marcum 2010 Pitch F/X data
setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")
marcum <- read.csv(file="marcum10.csv", h=T)
head(marcum)

marcfx <- subset(marcum, marcum$start_speed > 0)
marcfx <- na.omit(marcum)

#cleaning up the data
fourf <- subset(marcfx, marcfx$pitch_type=="FF")
fc <- subset(marcfx, marcfx$pitch_type=="FC")
ft <- subset(marcfx, marcfx$pitch_type=="FT")
cu <- subset(marcfx, marcfx$pitch_type=="CU")
ch <- subset(marcfx, marcfx$pitch_type=="CH")
shaun <- as.data.frame(rbind(fourf, fc, ft, cu, ch))
head(shaun)

#one-line way to subset for these conditions
shaun.b <- marcum[marcum$pitch_type %in% c("FF", "FC", "FT", "CH", "CU"),]


#plotting end speed as a function of starting speed
png(file="MarcScatter1.png", height=850, width=1000)
plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)")
dev.off()


#make the dots red
png(file="MarcScatter2.png", height=850, width=1000)
plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col="blue")
dev.off()


#make the dots red and filled in then red and triangles
png(file="MarcScatter3.png", height=1200, width=2000)
par(mfrow=c(1,2))
plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col="blue", pch=19, cex=2)
plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col="blue", pch=17, cex=2)
dev.off()


#draw text instead of points
png(file="MarcScatter4.png", height=850, width=1000)
plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed", type="n",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)")
text(shaun$start_speed, shaun$end_speed, shaun$pitch_type, col="blue")
dev.off()


#color by pitch type as colored text
png(file="MarcScatter5.png", height=850, width=1000)
plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type)
dev.off()


##UH OH, WHAT IS WHAT PITCH?
shaun$p_type_num <- as.numeric(shaun$pitch_type)
head(shaun)


#create numeric version of pitch type
shaun$p_type_b <- as.numeric(shaun$pitch_type)
head(shaun)


##use text and color
png(file="MarcScatter6.png", height=850, width=1000)
plot(shaun$end_speed ~ shaun$start_speed, type="n", main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type)
text(shaun$start_speed, shaun$end_speed, shaun$pitch_type, col=as.numeric(shaun$pitch_type))
dev.off()


##now that we know what is what, we can make a legend and just use the points and color
png(file="MarcScatter7.png", height=850, width=1000)
plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))
dev.off()


##now that we know what is what, we can make a legend and just use the points and color
png(file="MarcScatter8.png", height=850, width=1000)
plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))
legend(68, 84, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7))
dev.off()

Created by Pretty R at inside-R.org