Monday, January 31, 2011

sab-R-metrics: Some Extra Visualization Customization

Last post, I described a number of ways to show your data on a scatter plot. Ricky Zanker at THT has a similar post today for those looking to get some extra exposure and another take on R programming. Today, I plan to extend on this with a little more customization. First, if you've missed all of the previous sab-R-metrics posts, CLICK HERE to see them all. The code provided below should also get your data file back to where it needs to be to keep along here.


#setting directory and opening Shaun Marcum 2010 Pitch F/X data

setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")

marcum <- read.csv(file="marcum10.csv", h=T)
head(marcum)

#subset for main pitches
marcfx <- subset(marcum, marcum$start_speed > 0)
shaun <- marcfx[marcfx$pitch_type %in% c("FF", "FC", "FT", "CH", "CU"),]



When we left off last time, you most likely have something in your R code that produces the following graphs:


We worked with different text, points, and legends as you can see above. Depending on your purposes, text can be extremely useful. But you know what, I'm a little tired of these "Start Speed vs. End Speed" plots, so let's get into plotting Pitch Location instead. That way, we'll have a need for things like lines and shapes (to show the strike zone).

For the purposes of these posts, I won't normalize the strike zone height. I'll simply use the average top and bottom of the zone that Marcum saw for the entire season. All this means is that one needs to be a little more careful when determining the top and bottom of Marcum's strike zone. In addition, I often use the book strike zone on my plots. Others have shown that the actual 'strike zone' for umpires is about a foot across, which is a little wider than the plate. That's fine as well, but I prefer to have the book zone on my plots. Just keep that in mind when saying something is outside the strike zone. Let's start with some knowledge of the variables in our data set.

px: Vorizontal location (in feet). Negative numbers mean the pitch is to the left of the center of the plate, while positive numbers indicate the pitch is to the right. Keep in mind my left/right comparison is from the catcher view.

pz: Vertical location (in feet). This is the height of the pitch from the ground when it crosses the front of the plate. There are some negative numbers here, which from my knowledge (those out there correct me if I am wrong) indicate a ball bouncing in front of the plate. Because the system extrapolate the pitch from a little ways before the front of the plate, we see these negative numbers.

sz_top: This indicates the top of the strike zone as relatively subjectively drawn by a person. If you fiddle with a full F/X database, you'll see that the top varies sometimes for the same player. Here, we'll use the overall average of this in Marcum's data for the top of the zone.

sz_bot: This is just like the top of the zone, but for the bottom. Should be drawn around the knees of the player.

pitcher_handedness: I think this is pretty straight forward. It is a string/text variable entered as "L" or "R". Remember that we'll be viewing from that catcher/umpire view in our plots.

batter_handedness: Again, pretty straight forward. It is a string/text variable entered as "L" or "R". Remember that we'll be viewing from that catcher/umpire view in our plots. To make things more complicated, from a catcher or umpire's view facing out toward the pitcher, a right-handed batter stands to the catcher's left, while a left-handed batter stands to the catcher's right.

balls: This indicates the number of balls in the count when the current pitch is thrown.

strikes: This indicates the number of strikes in the count when the current pitch is thrown (i.e. the pitch in a row with balls=0 and strikes=0 indicates that the pitch was thrown in an 0-0 count. If the pitch was called a strike, then the count after this pitch was thrown will be 0-1).

outs: This indicates the number of outs when the current pitch is thrown.

pitch_result: This gives a text interpretation of what happened for this pitch. For this post, I'll be conditioning on this variable simply for "Called Strike", "Ball", and "Swinging Strike".

result_type: This is a one-letter indication of the result of the pitch. S indicates a strike, B indicates a ball, and X indicates that the pitch was put in play.

Okay, now we're set. Let's use the skills we've been working on in the last bunch of posts and draw a standard scatter plot using the px and pz variables. Remember to make "pz" a function of "px" (or similarly, px=x and pz=y) when writing your code. For now, let's keep it basic and just plot all the pitches from the 'shaun' subset of the data using "plot()".

##create a basic pitch location plot
png(file="location1.png", height=675, width=540)

plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", cex=2, pch=1)


dev.off()



As usual, I started pretty bland. But this is always good to do simply to make sure your data is plotting correctly before you go on with lots and lots of details. Again, we have some trouble distinguishing pitch type with the above plot, so we can either add text instead or vary the color/point type for each pitch as we did last time. I prefer the latter, but I show both below along with the code to make them.


##create a location plots using a legend and using text

png(file="location2.png", height=675, width=1080)

par(mfrow=c(1,2))


plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)",
ylab="Vertical Location (ft.)", type="n")
text(shaun$px, shaun$pz, shaun$pitch_type, col=as.numeric(shaun$pitch_type))

plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))

legend(-2.95, 0.1, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

dev.off()



This much data isn't particularly enlightening, and it is why density-based heat maps are a better representation when dealing with larger data sets (I'll get to those a bit later, but I do provide the code for mine). To reduce our data a bit, why don't we go ahead and condition the data on being a called strike. See below for the code and the plot it should produce:

##condition the above on being a called strike
png(file="location3.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))

legend(-2.95, 0.1, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

dev.off()



Quick visualization quiz: do you notice any problems (outside of the fact that Yellow just isn't a good choice on a white background)? Before I get to drawing some shapes or lines to indicate a strike zone, I want to talk about customizing the axes. Notice in the above plot that the axes only go from -1.5 to 1.5 on the horizontal and 1.5 to 4 on the vertical. This is because the ranges of our px and pz values are much smaller for only pitches that are called strikes. This makes intuitive sense: called strikes should tend to be closer to the plate on average.

Sometimes, we may not be worried about this, as this range helps to distinguish the points in the plot a little better. However, when doing visualizations we often compare things to one another. Therefore, we want things to be exactly the same, except for the difference we are trying to highlight. Otherwise, the way people look at the plots could be biased. This is not only true when comparing across plots, but also when looking on the same plot. Therefore, we also need to consider our PNG file 'width' and 'height' to ensure that the tick marks on the X and Y axes represent the same distance between them. If we stretch the image too tall, this could make it look like vertical feet are actually longer than horizontal feet (in fact, this is the case in the plot above!). Of course, we know this isn't how it should look in real life! What to do?

Luckily, R is very flexible and also allows us to control the axes in our plots. For the simple "plot()" function, this is straight forward. For other functions downloaded from the CRAN site (I'll get into this later), this issue is not quite as straight forward. But let's start simple. While I like the above dimensions for a Called Strike only plot, let's put this next to a Called Ball only plot with the same dimensions.

Below I do this using the "xlim=c(,)" and "ylim=c(,)" options for our plotting function. In the "c(,)" portion, set the minimum and maximum values for each axis separated by a comma. Generally, I like to use -2 to 2 for the horizontal axis and 0 to 5 for the vertical axis in my heat maps. In this data and for a simple scatter plot, these dimensions won't cover every single pitch, but they will get pretty close. If you are unsatisfied with the dimensions below, fiddle around with your own...just keep in mind that I have the PNG device set up to make the axes have the same width between tick marks.


##create two plots with the same dimensions
png(file="location4.png", height=700, width=1080)

par(mfrow=c(1,2))


plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Pitch Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))

legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)


plot(shaun$pz[shaun$pitch_result=="Ball"] ~ shaun$px[shaun$pitch_result=="Ball"], main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))

legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)
dev.off()


In the code above, I already manually calibrated the PNG file creator to make the horizontal and vertical aspect ratio the same distance apart in the visual, given the data range we used (also, for a single plot with these dimensions, you'll want to use "height=675" and "width=540"). This is one of the reasons to be sure to use a graphical device: you can save the size and be sure that everything is comparable later on. Also, remember to set your legend somewhere that it won't overlay on your data. In the end, the visuals above look pretty much like we would expect don't they? Called strikes around the middle, with called balls making a big circle around the middle.

Returning to just our "Called Strikes" plot, it is always instructive to put some sort of indication of a strike zone on the plot. For just drawing a simple zone, I like to use the book zone. For drawing onto a plot in the window, we can use "lines()" or "abline()" as a beginner step. However, the "rect()" function is a bit more convenient.

For the strike zone, we want to draw 4 different lines: one for the top of the zone, one for the bottom, and one for each side of the plate. Googling a bit will tell you that the width of the plate is 17 inches, or 1.4167 feet. Halfway across we have 0.708333 feet. Therefore we'll want to draw a line at -0.708333 and 0.708333 (the distance to the left and right of the center of the plate where the strike zone ends). In addition, we'll want to draw a line at the top and bottom of the zone, which here I'll indicate simply using the average of each for the entire data set.


Beginning with the "abline()" function, this tells R to draw a straight vertical or horizontal line across your entire plot. After plotting your pitches, you can draw these lines right onto the plot using this function (as well as the others I will go over). Below is some code that will do this, and also divides the entire plot up into 9 boxes. I also include the parameters for the lines "lty=" and "lwd=". These specify the line type and line width, respectively. The default for lty is "solid", while the default for lwd is 1. We can do a number of line types, including dotted and dashed. Of course, we can again use the color parameter for lines.

##add in a strike zone using "abline()" first
png(file="location5.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))

legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)


abline(h=mean(shaun$sz_bot), col="black", lty="dashed", lwd=2)

abline(h=mean(shaun$sz_top), col="black", lty="dashed", lwd=2)
abline(v=-0.708335, col="black", lty="dashed", lwd=2)

abline(v=0.708335, col="black", lty="dashed", lwd=2)


dev.off()



Notice above that for the lines drawn horizontally across the plot, I use the "abline(h=)" version, while vertical lines are drawn with the "abline(v=)". You can already see that the strike zone umpires in MLB call tends to stretch outside the 'book' zone by a few inches. Sometimes, we may not want the strike zone lines to extend throughout the entire plot. For this, we can make use of the "lines()" function just as we do with "abline()". However we'll need a few more specifications included here.

For this function, you'll need to specify not just where the line should be drawn vertically or horizontally, but also the endpoints for the line to stop at. We'll make use of the "c()" command here. For example, the first "lines()" code line says, "Draw a vertical line at 0.708335 on the x-axis from the average bottom strike zone to the average top strike zone on the y-axis and make it black, dashed, and a width of 2".

##use the "lines()" specification
png(file="location6.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))


legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)


lines(c(0.708335, 0.708335), c(mean(shaun$sz_bot), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, -0.708335), c(mean(
shaun$sz_bot), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, 0.708335), c(mean(shaun$sz_bot), mean(shaun$sz_bot)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, 0.708335), c(mean(
shaun$sz_top), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)

dev.off()



While the above code works great, it's not all that efficient. It takes a while to make sure we're drawing everything correctly. Luckily, R also allows us to simply draw a rectangle. Ultimately, the "rect()" function also allows us to fill it with color (and transparent color). However, I'll leave dealing with color and color palettes for another day. Here, we can use it to draw the strike zone in a single line of code. The parameters for the rectangle are similar to how we specified the lines above, but here each 'number' is a corner of the box. The other arguments are as I described above for the border of the rectangle, with "border="black"" saying that we want a colored border for our rectangle. I don't post another plot, as it looks exactly the same as the previous one.

##now add in a strike zone using "rect()"
png(file="location7.png", height=675, width=540)


plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)", xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))

legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)


rect(-0.708335, mean(shaun$sz_bot), 0.708335, mean(shaun$sz_top), border="black", lty="dashed", lwd=2)


dev.off()



At this point, I have discussed most of what I had hoped with base graphics and scatter plots. There are a plethora of options I haven't discussed, and I have not gotten to the "lattice" package or the "ggplot2" package which can significantly improve plotting in R. Lattice is particularly useful for looking at multiple plots at once. Those are a bit down the road for now, and I am still a beginner with ggplot2.

Next time, I'll get to making lines using time series type data. From there, my hope is to get into some actual statistical analysis in R and some non-parametric methods (like loess or density estimation). Below, I've posted a pretty version of the code:

#########################################

####Line and ABlines, Points, Shapes, and Custom Axes
#########################################

#setting directory and opening Shaun Marcum 2010 Pitch F/X data
setwd("c:/Users/bmmillsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")
marcum <- read.csv(file="marcum10.csv", h=T)
head(marcum)


#subset for main pitches
marcfx <- subset(marcum, marcum$start_speed > 0)
shaun <- marcfx[marcfx$pitch_type %in% c("FF", "FC", "FT", "CH", "CU"),]


##create graphs to review from last time
png(file="ReviewMarcum.png", height=850, width=1500)
par(mfrow=c(1,2))

plot(shaun$end_speed ~ shaun$start_speed, type="n", main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type)
text(shaun$start_speed, shaun$end_speed, shaun$pitch_type, col=as.numeric(shaun$pitch_type), cex=2)

plot(shaun$end_speed ~ shaun$start_speed, main="Shaun Marcum Pitch Speed",
xlab="Speed Out of Hand (mph)", ylab="Speed Crossing Plate (mph)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), cex=2)
legend(68, 84, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=2)

dev.off()

##create a basic pitch location plot
png(file="location1.png", height=675, width=540)
plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)",
ylab="Vertical Location (ft.)", cex=2, pch=1)
dev.off()

##create a legend and use text
png(file="location2.png", height=675, width=1080)
par(mfrow=c(1,2))

plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)", xlab="Horizontal Location (ft.)",
ylab="Vertical Location (ft.)", type="n")
text(shaun$px, shaun$pz, shaun$pitch_type, col=as.numeric(shaun$pitch_type))

plot(shaun$pz ~ shaun$px, main="Shaun Marcum Pitch Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))
legend(-2.95, 0.1, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

dev.off()


##condition the above on being a called strike
png(file="location3.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3))
legend(-1.4, 3.9, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

dev.off()


##create two plots with the same dimensions
png(file="location4.png", height=700, width=1080)
par(mfrow=c(1,2))

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))
legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

plot(shaun$pz[shaun$pitch_result=="Ball"] ~ shaun$px[shaun$pitch_result=="Ball"], main="Shaun Marcum Called Ball Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))
legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

dev.off()


##add in a strike zone using "abline()" first
png(file="location5.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))
legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

abline(h=mean(shaun$sz_bot), col="black", lty="dashed", lwd=2)
abline(h=mean(shaun$sz_top), col="black", lty="dashed", lwd=2)
abline(v=-0.708335, col="black", lty="dashed", lwd=2)
abline(v=0.708335, col="black", lty="dashed", lwd=2)

dev.off()

##use the "lines()" specification
png(file="location6.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))
legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

lines(c(0.708335, 0.708335), c(mean(shaun$sz_bot), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, -0.708335), c(mean(shaun$sz_bot), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, 0.708335), c(mean(shaun$sz_bot), mean(shaun$sz_bot)), col="black", lty="dashed", lwd=2)
lines(c(-0.708335, 0.708335), c(mean(shaun$sz_top), mean(shaun$sz_top)), col="black", lty="dashed", lwd=2)

dev.off()


##now add in a strike zone using "rect()"
png(file="location7.png", height=675, width=540)

plot(shaun$pz[shaun$pitch_result=="Called Strike"] ~ shaun$px[shaun$pitch_result=="Called Strike"], main="Shaun Marcum Called Strike Location (Umpire View)",
xlab="Horizontal Location (ft.)", ylab="Vertical Location (ft.)", col=shaun$pitch_type, pch=c(as.numeric(shaun$pitch_type) + 3), xlim=c(-2,2), ylim=c(0,5))

legend(-2, 5, c("Change-Up", "Curveball", "Cutter", "Four-Seam", "Two-Seam"), pch=c(5,6,8,9,10), col=c(2, 3, 5, 6, 7), cex=1)

rect(-0.708335, mean(shaun$sz_bot), 0.708335, mean(shaun$sz_top), border="black", lty="dashed", lwd=2)

dev.off()

Created by Pretty R at inside-R.org


1 comment:

  1. Information visualization Low We think it is the most crucial for the lifetime. It may allow a lot more understanding of each of our actual. So you I am sure that you just allow a lot more robust understanding of this specific if you visit this website.Housing market forecast Low

    ReplyDelete