Thursday, March 31, 2011

Gauging Interest: 2012 Keeper League

So I know it's a bit early--and I'm probably nuts to start up a new keeper league as commissioner in the last semester of my dissertation writing next year--but this is something I've wanted to do for a while.

BUT, I'd love to gauge some interest in a keeper league that will start up in 2012. I'm looking for very serious players who are willing to invest about $350 into the league each year (don't worry, if you have the worst team in the league, it's likely you won't lose more than $120). The reason the fee is so high is that you pay your own auction salary.

As of now, I'm envisioning this as an 8x8 Head-to-Head Each Category league with a $300 cap and minor league players. 6 Keepers per year (plus minor league keepers), Rule 5 drafts, contracts, the works. Sessions will be 2 weeks long, rather than one, and there will be playoffs. 20 teams. You get paid a certain amount for each category win throughout the season (so you're paid marginally based on overall record).

I have linked my current league constitution here.
(Warning: It's a doozie. 8,500 words on 16 pages).

I'd do things through League Safe, assuming they can handle the complex payout structure. Otherwise, I'll figure something out. The rules are currently not particularly negotiable (with a couple exceptions). Why? Because this constitution is based off a 6-year running league that has run into numerous problems. Each rule in the constitution is there for a specific and important reason that I have had experience with. That doesn't mean that I won't entertain suggestions, but once the league starts, rules cannot be revisited until the next offseason.

Anyone interested please stick your email in the comments with your fantasy experience, expertise, or current field of employment (I'd like to get a mix of strategies) and/or shoot me an email (bmmillsy at umich dot edu).

Data Quality in Pitch F/X

This post stems from a discussion with Mike Fast about quality of the "sz_top" and "sz_bot" variables in Pitch F/X data. I had been using these to designate the strike zone for my calculations in past posts. I want to thank Mike for being generous with his time to answer some of my questions and keep me from publicly writing something stupid.

I was aware that the lines drawn for Top and Bottom of the zone were somewhat inaccurate. However, one thing I did not count on would be that this variation would systematically bias findings in the data across years. As a whole, we would normally expect that these measurement errors are random (for a given player, not across players). In theory, random measurement errors are totally fine. While they make the data noisy, they should not bias our measurements and with really big data, they should be mostly ignorable when we do certain calculations.

But over time, this just doesn't seem to be the case. This is the main reason I took down the data from my last post (I'll update it as soon as I can and repost it). The inaccuracy of the data tends to stem from the correlation between the zone designation at the top and bottom, and the percentage of pitches WITHIN the zone also called strikes. That's no surprise, and normally I wouldn't worry too much about this as we'd expect it's simply noise and we'd expect some uniform change inside and outside the zone if we change the size of the strike zone.

However, the interesting part is that it seems to have a minimal effect--if any--on the pitches correctly called Balls that are actually outside the zone. I'm still not sure why this is the case. We'd expect that fixing the zone would similarly affect the percentage of correctly called pitches both within and outside the zone (after all, any that are no longer 'outside' the zone MUST be 'within' the zone--though less on the 'outside zone' data because there are more pitches outside the zone than within the zone). The only thing I can think of is that it's a sample size issue: there are many more pitches outside the rulebook zone than inside the zone (just under 3 times as many). But I can't imagine this accounts for such a huge change in one and almost no change in the other.

With that said, I thought I would provide some data for those looking to mess with these variables in the Pitch F/X data. In the file linked here, I have calculated the average Top and Bottom of zone for each player in each year, along with the standard deviation. The data are in both feet and in inches. Below, I also show the range of values for sz_top for Bobby Abreu in 2007, 2009 and 2010 (I skip 2008 for now). Finally, I give a distribution of standard deviations for the measurement by player from 2007 through 2010. From the looks of things, something was changed in mid-2007 about how they designated the top of the zone (notice the bimodal distribution).

Anyway, just a heads up. Like I said, I'm still not clear on why this is systematically changing the Within Zone tabulations but NOT the Outside Zone tabulations. I'll post the file once I figure out what is going on.



Saturday, March 26, 2011

Umpire Strike Call Percent, In and Out of Rulebook Zone

I've finished up some preliminary tabulations for umpire calls within and outside the rulebook zone. Because it's a fairly large table, I'm not going to present it directly on this page. However, you can download the file here.

Keep in mind that these are based on the rulebook zone. The numbers say nothing but how well the umpires conform to their own zone. Umpires tend to have their own zone, which are likely well-known by the players in the game. Some zones are shifted outside, most stretch a bit beyond the edges of the plate, and so on. Combining these numbers with the visuals in my previous post are your best bet for understanding where the "Incorrect" calls are coming from. Most likely, these are just outside the book zone but within the 2-foot wide zone.

We really don't know WHY the zone tends to extend beyond the plate for umpires (well, maybe someone does, I don't know though). One suggestion is that Pitch F/X measures the center of the ball, so there is 1.5 inch worth of ball on either side. That extends the zone 1.5 inches beyond the plate on each side assuming an umpire calls a strike if ANY portion of the ball touches the black.

The rest of it could simply be a perception issue. The ump looks from the center of the plate toward the outside of the plate. Because anything on the corners is viewed at an angle, the umpire makes some sort of guess based on visual cues as to whether it went directly over the plate (they don't have a perfect bird's eye view of every pitch). The question then becomes: how should we evaluate them? The book zone, or a predictable zone for each umpire? I'll leave that question for another day and for someone else to answer.

There is a separate worksheet for each year (2007 to 2010). I might add some other info in the next couple days to the file like # of pitches within and outside the zone, and counts of each designation in the file. There are some extra umpires in the file with no data, and many of these are guys that got some assignments in spring training. The data should be only regular season games. Lastly, keep in mind that the correctly called balls and incorrectly called strikes do not add up to 100%. This is because I left out Pitch Outs and Intentional Balls from the cross-tabulations.

If you use them for any write ups anywhere, I appreciate a cite or link back. At the least just let me know, because I'd like to see what people do with the data just for my own curiosity.

UPDATE: DATA IS NOW AVAILABLE HERE.

Thursday, March 24, 2011

Umpire Strike Zones

Recently, I've been working on a new post for FBJ. Hopefully, that will be ready to go tomorrow, but with the publication of Jeff Zimmerman's umpire projections today, I thought I'd post some stuff here. Jeff makes some cool plots and has some cross-tabulations of umpire strike call percentage for a few years. However, it seems like something went wrong. If you're curious, go over there and also check out the comments.

Below is a quick table of umpires that were behind the plate for at least 5,000 plate appearances from 2007 through 2010 (for which Pitch F/X data is available). From the looks of things, the umpire can have just over a two-run effect on the outcome of the game due to his strike zone (ADDENDUM: MGL correctly points out in the comments that my language is imprecise, and the assumption that the noise is evened out is too strong. I agree he is correct. I should have said that the difference in the data is a bit over 2 runs, NOT that the EFFECT was a little over 2 runs. His suggestion is that the effect is about 0.6 runs. I'll see what other info I can get out of the data.). Of course, we're assuming that umpires are randomly assigned and that the quality of the pitching and hitting evens out over the 5,000 plate appearances, which is a pretty strong assumption. But even if the range of the effect was only a single run, I think this would be pretty significant. The data below is for 2007 through 2010.

Umpire First Name Umpire Last Name Games PA Strikeout % OBP SLG AVG Runs Per Game
Jerry Crawford 87 6834 16.56% 0.3459 0.4268 0.2639 10.17
Angel Campos 84 6466 18.33% 0.3361 0.4191 0.2658 9.92
Gerry Davis 140 10822 16.68% 0.3354 0.4250 0.2635 9.89
Tim Welke 127 9755 18.60% 0.3324 0.4215 0.2636 9.83
Chad Fairchild 131 10262 18.01% 0.3343 0.4181 0.2617 9.82
Jim Reynolds 130 10079 18.25% 0.3372 0.4234 0.2690 9.74
Tim McClelland 144 11090 16.47% 0.3418 0.4168 0.2660 9.72
Tim Tschida 135 10528 17.31% 0.3413 0.4167 0.2678 9.69
Larry Vanover 133 10202 17.70% 0.3312 0.4153 0.2617 9.68
Sam Holbrook 139 10618 17.48% 0.3345 0.4280 0.2628 9.68
Bill Welke 132 10248 17.94% 0.3357 0.4153 0.2690 9.64
Mike Reilly 138 10705 17.91% 0.3410 0.4241 0.2666 9.62
Randy Marsh 93 7050 15.26% 0.3435 0.4173 0.2671 9.52
Alfonso Marquez 103 8103 16.46% 0.3380 0.4093 0.2609 9.50
Scott Barry 110 8366 16.91% 0.3365 0.4206 0.2608 9.48
Tim Timmons 134 10349 17.61% 0.3314 0.4173 0.2650 9.48
Paul Schrieber 110 8678 16.73% 0.3450 0.4100 0.2610 9.46
Brian Knight 128 9760 17.01% 0.3368 0.4200 0.2646 9.46
Jerry Meals 138 10596 17.57% 0.3322 0.4190 0.2617 9.44
Adrian Johnson 120 9338 17.56% 0.3376 0.4147 0.2601 9.39
Dana DeMuth 141 10871 17.66% 0.3330 0.4060 0.2599 9.38
Brian Gorman 139 10599 17.81% 0.3312 0.4233 0.2657 9.37
CB Bucknor 138 10771 17.44% 0.3361 0.4121 0.2669 9.34
Chuck Meriwether 105 8079 17.45% 0.3296 0.4058 0.2608 9.31
Ed Hickox 105 7955 17.88% 0.3243 0.3943 0.2513 9.31
Eric Cooper 133 10174 17.75% 0.3293 0.4119 0.2643 9.31
Tony Randazzo 102 7881 17.64% 0.3283 0.4246 0.2646 9.29
Marvin Hudson 136 10702 17.81% 0.3336 0.4028 0.2592 9.29
Charlie Reliford 75 5699 17.70% 0.3226 0.3980 0.2558 9.24
Wally Bell 142 10937 18.20% 0.3274 0.4198 0.2593 9.24
Lance Barksdale 139 10545 17.52% 0.3323 0.4062 0.2552 9.24
Greg Gibson 135 10583 17.00% 0.3311 0.4046 0.2568 9.23
John Hirschbeck 81 6167 17.97% 0.3256 0.4106 0.2585 9.21
Dan Iassogna 138 10521 18.40% 0.3345 0.4112 0.2609 9.20
Todd Tichenor 85 6480 17.02% 0.3375 0.4040 0.2628 9.19
Derryl Cousins 139 10809 17.73% 0.3262 0.3952 0.2496 9.18
James Hoye 147 11464 17.81% 0.3295 0.4014 0.2572 9.15
Joe West 142 11016 17.27% 0.3281 0.4067 0.2538 9.14
Jim Joyce 131 10070 16.74% 0.3341 0.4036 0.2599 9.14
Dale Scott 142 10816 18.14% 0.3325 0.4143 0.2623 9.13
Marty Foster 121 9343 18.41% 0.3285 0.4101 0.2584 9.12
Ted Barrett 141 10802 17.79% 0.3263 0.4078 0.2568 9.11
Mike Everitt 143 11021 18.10% 0.3279 0.4114 0.2569 9.09
Kerwin Danley 109 8248 17.34% 0.3359 0.4069 0.2633 9.08
Fieldin Culbreth 142 10848 17.16% 0.3311 0.4175 0.2603 9.01
Tom Hallion 138 10428 18.37% 0.3251 0.4121 0.2561 9.01
Brian Runge 120 9048 18.39% 0.3238 0.4149 0.2590 8.99
Laz Diaz 139 10683 18.41% 0.3234 0.4069 0.2560 8.99
Bruce Dreckman 123 9573 17.05% 0.3290 0.4013 0.2579 8.98
Paul Nauert 137 10471 17.85% 0.3262 0.4146 0.2602 8.98
Gary Darling 131 9874 18.14% 0.3289 0.4100 0.2621 8.96
Mike DiMuro 109 8386 18.28% 0.3219 0.3997 0.2515 8.95
Mark Wegner 133 10173 18.34% 0.3279 0.3991 0.2518 8.94
Phil Cuzzi 138 10492 18.76% 0.3252 0.4067 0.2582 8.93
Angel Hernandez 141 10650 17.29% 0.3279 0.3962 0.2557 8.90
Ed Rapuano 140 10689 17.55% 0.3293 0.4072 0.2579 8.89
Bob Davidson 140 10803 17.40% 0.3307 0.3924 0.2576 8.86
Mike Winters 133 9904 18.35% 0.3302 0.4070 0.2620 8.86
Rob Drake 146 11091 18.86% 0.3231 0.4019 0.2515 8.85
Jim Wolf 133 10133 18.01% 0.3313 0.4078 0.2604 8.83
Hunter Wendelstedt 140 10625 17.37% 0.3258 0.4021 0.2558 8.81
Bill Miller 142 10852 18.69% 0.3186 0.4026 0.2534 8.77
Brian O'Nora 124 9305 17.69% 0.3221 0.4100 0.2571 8.77
Ron Kulpa 130 10016 18.24% 0.3286 0.4033 0.2578 8.76
Jerry Layne 118 9071 17.43% 0.3313 0.4008 0.2525 8.71
Mark Carlson 107 7971 18.15% 0.3266 0.4053 0.2565 8.67
Jeff Kellogg 143 10784 17.23% 0.3291 0.4101 0.2563 8.66
Paul Emmel 134 10107 18.77% 0.3195 0.3924 0.2537 8.65
Chris Guccione 148 11205 17.72% 0.3303 0.3999 0.2578 8.64
Jeff Nelson 123 9399 18.24% 0.3248 0.3997 0.2523 8.63
Gary Cederstrom 138 10387 18.07% 0.3292 0.4031 0.2583 8.62
Doug Eddings 140 10530 18.64% 0.3237 0.4112 0.2596 8.56
Andy Fletcher 117 8930 18.91% 0.3221 0.3852 0.2491 8.20
Mike Estabrook 83 6265 18.13% 0.3200 0.3848 0.2559 7.95
Bill Hohn 91 6618 16.88% 0.3234 0.3965 0.2505 7.91


Anyway, Jeff's post was more about strike calling percentage than anything else. His tables seem strange, and if they're telling me what I think they're telling me, then I don't think they're correctly. For example, of all pitches called strikes by the umpire in 2010, I have about 65% of those falling within the RULEBOOK strike zone (that means the edges of the plate, NOT the 2-foot wide zone commonly used for the zone).

PRELIMINARY DATA HAS BEEN REMOVED BECAUSE I'VE SEEN IT ABUSED IN CERTAIN PLACES. PLEASE SEE LATEST VERSION OF DATABASE!

Below, I show a table of a number of things. The first 3 columns show the percentage of pitches within the rulebook strike zone CORRECTLY called a strike. Similarly, the next 3 columns show the percentage that each umpire CORRECTLY calls a ball when it is truly outside the strike zone. I do this for all batters, RHB, and then LHB.


Next, I also tally up the INCORRECT ball and strike calls. So these are the percentages that each umpire calls a Strike on a pitch that is actually OUTSIDE the rulebook zone OR calls a Ball on a pitch that is truly WITHIN the rulebook zone. Again, keep in mind I use the rulebook zone, rather than the standard 2-foot wide zone:

PRELIMINARY DATA HAS BEEN REMOVED BECAUSE I'VE SEEN IT ABUSED IN CERTAIN PLACES. PLEASE SEE LATEST VERSION OF DATABASE!

I was in the process of also recording the total number of pitches called by each umpire to put it in perspective, but did not have time before posting this. I'll add that stuff later on. I think it's pretty obvious that Barrett doesn't have a perfect call percentage with LHB up to bat.

Anyway, I'll have more on this later. For now, look at the zones below from 2010 for all of the umpires in video format (yeah, yeah, I re-posted it but it sure makes sense to have it in this post as well).

NOTE: I fixed the videos. I was made aware that no one could see them because of Facebook privacy settings. Please let me know if there is still a problem. DUH!

Another Update: I added pitch counts for 2010 to the data tables above as to keep from making big conclusions with small sample sizes. When comparing RHB to LHB, remember that it's pretty common to have the LHB zone shifted outside. Because I have used the BOOK ZONE to gauge 'correctness' of the call, these will be skewed a bit. Also, I am working on getting the tables a bit more manageable for Blogger, which continues to disappoint me with its formatting capabilities.





Tuesday, March 22, 2011

sab-R-metrics Sidetrack: Bubble Plots

While I had mentioned in my last post that I will cover logistic regression in my next post, I decided that a quick interlude in working with bubble plots would be fun. Bubble plots have become pretty popular recently, especially with all of the Visualization Challenges I've seen around the internet (by the way, I think people in the sabermetric world have a great chance to win some of these, despite the fact that they're generally not baseball data).Ultimately , bubble plots are a good way to present a third dimension on a graph.

Today, I'll talk about doing some basic bubble plots using some Red Sox and Yankees data on attendance and wins over time (click here for the "soxyanks.csv" data link). If you remember my quick tutorial on plotting time series data, I showed how to track wins and attendance over time. However, we often want to include the most information possible on our plots, and that often means presenting a third (or fourth) variable. This makes the 2-dimensional world of plotting more challenging, and that is where bubbles come in (Side Note: It is also why heat maps are so extensively used for Pitch F/X data!).

Okay, so what do the bubbles tell us? Generally, the size of the bubble is meant to represent that third dimension. For wins and attendance over time, it's not straight forward to track these on the same plot. You could normalize them so that they're on the same scale and then plot them together, but this is a difficult comparison over time when something like attendance is growing. Of course, this is a common time series issue that I'm not going to get into on this site in which you could take a first difference approach or some other more complicated model, do some smoothing, go into the frequency domain, and so on. But you don't want to hear about unit roots, random walks, and the like. You're here for baseball and fun...right?. If you normalize the two variables--just using standard z-scores--you'll end up with something like this:

Bleh. Assuming we think the above plot is useful and want to compare two teams, we probably have to make side-by-side plots. It's easier to compare sometimes when things are on the same plot. So we can represent something like winning using bubble size at each year, with attendance on the y-axis. Let's load in the data and start thinking about our variables and just plot Yankees and Red Sox attendance on the same time plot at first:

##set working directory and load data

setwd("c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics")


##load data

ball <- read.csv(file="soxyanks.csv", h=T)


head(ball)


##attendance time plot

plot(ball$yank.att ~ ball$year, xlab="Year", ylab="Yankees vs. Red Sox Attendance",
main="Average Attendance Per Game", col="darkblue", type="l", lwd=3)

lines(ball$bos.att ~ ball$year, col="darkred", lwd=3)


legend(1900, 54000, legend=c("Yankee", "Red Sox"), fill=c("darkblue", "darkred"))


Pretty easy to see the general trend in attendance over time, with the usual spikes. However, this doesn't give us much information about the wins of each team over time. We could make a separate plot to compare wins over time for each team. Or, we can represent this new dimension using bubbles at each time point, where the size of the bubble represents the winning percentage of each team in each year.

There are a number of ways to do this in R, and I'll begin with a simple one: simply using the command "cex=" to indicate point size based on some variable. There are some shortcomings with this method, but I'll talk about that later. Beginning with just the Yankees, let's plot some points in addition to our lines (keep in mind this is a starter point--this plot will be ugly):

##plot yankees attendance and wins using "cex="

plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col="darkgrey", main="Yankees Wins & Attendance Over Time", xlab="Year", ylab="Average Game Attendance")


lines(yank.att ~ year, data=ball, lwd=2, col="darkblue")

legend(1900, 54000, legend=c(".250 W%", ".350 W%", ".450 W%", ".550 W%", ".650 W%", ".750 W%"), col="darkgrey", pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=c(.6,.7,.8,1,1.25,1.5), bty="n")




The legend in the above plot is a bit complicated and is unfortunately the best I can do with this code. Later in this post, I'll show another way to do these based on some code in this tutorial. Honestly, I think my legend is a bit ugly and I'm pretty sure that the "ggplot2" package has a better way. Also notice that I use a polynomial to scale the bubbles. Normally, I wouldn't recommend doing this; however, because of the small range of win percents, this tends to give more useful size ranges for plotting. If you want to do a simple linear transformation, you can multiply the win percents by a constant instead...or use wins (which is problematic since teams have not played the same number of games for the entire time period). The reason this can become a problem is that we want the bubbles to have proportional area based on the win percent. I'll talk about this in a few paragraphs below, but will first talk about some color issues.

Unfortunately, the bubbles all mesh together in the plot. And I'll use this portion of the tutorial as a lesson in the RGB color scale in R, along with how to work with transparent colors. The RGB scale stands for Red-Green-Blue. It's just like that guy with the insanely deep voice talking about the new Sharp televisions (except they add yellow). So, while we can use the names of colors (and just general numbers for colors), we can also use the RGB scale to make our own colors.

I'll just start with a simple way to work with the color scale. When using this scale, you will need to input an 8-digit number in the form of:

col="#00000000"

The first two digits will tell how much Red to put into the color (on a 00 to 99 scale). The second two digits do the same for Green, and the third pair of digits do this for Blue. Finally, the last pair of numbers will tell R how transparent you want your color to be. For lots of transparency, you set this number low. For less transparency, you set it high. We can use this to our advantage in the bubble plots so that we can see the outline of each bubble if they overlap. So let's rework the Yankee plot above, but make some transparent colors:

##now do Yankee plot with transparent colors

plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col="#99999950", main="Yankees Wins & Attendance Over Time", xlab="Year", ylab="Average Game Attendance")


lines(yank.att ~ year, data=ball, lwd=2, col="darkblue")


legend(1900, 54000, legend=c(".250 W%", ".350 W%", ".450 W%", ".550 W%", ".650 W%", ".750 W%"), col="#99999950", pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=c(.6,.7,.8,1,1.25,1.5), bty="n")




This looks a little better, as you can see the outline of each bubble in the overlapping portions with other bubbles. You can see that the Yankees had a rough decade in the 1970's in both attendance and winning. Their attendance seemed to drop below what a normal trend would suggest in these years, and there seems to be a good chance that this was due to their sub-par on-field performance (remember, we're just speculating here). By this, you can see some advantage to including bubbles for this type of data.

Now, let's go ahead and add the Red Sox data to this plot. I altered the key just a little bit, but still not to my liking:

##have both Yankees and Red Sox on same plot

plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col="#99999950", main="Yankees vs. Red Sox Wins and Attendance Over Time", xlab="Year", ylab="Average Game Attendance")


points(bos.att ~ year, data=ball, pch=16, cex=20*bos.win^3, col="#99000050")


lines(yank.att ~ year, lwd=2, col="darkblue", data=ball)


lines(bos.att ~ year, lwd=2, col="darkred", data=ball)


legend(1900, 54000, legend=c(".250 W%", ".350 W%", ".450 W%", ".550 W%", ".650 W%", ".750 W%"), col="#99999950", pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=2, bty="n")



Here we can see the demise of the Red Sox in the 1920's, as their performance was so bad we can barely see their win bubbles. Red Sox attendance was low at those points, and we see this happen again not long after the WWII attendance bump. Then, when the Yankees start sucking in the 70's, we see the Red Sox attendance rebound a bit as the team improves a little. See how the bubbles help us to tell a story over time.

It's always important to think about the shortcomings of these plots. Obviously, the bubbles are not growing in a linear fashion, and this can be misleading in some cases. In addition, things are a bit crowded. That's not even mentioning that some bubbles tend to be too small, while others are too large. These aren't the prettiest plots in the world, but they're a decent start. I encourage you to try out different data and ways of working with the bubbles on your own.

So, let's switch gears now to some other types of data along with another method of creating bubble plots.

Perhaps we're interested in team home runs, stolen bases, and walks on the same plot. In other words, let's see which teams are more like Adam Dunn and which are more like Juan Pierre First, go ahead and load in the "teamsdata.csv" file from a previous tutorial.

##read in new data

teams <- read.csv(file="teamsdad.csv", h=T) head(teams)

For this portion of the tutorial, I'll be using the "symbols()" function, which plots shapes with borders in a plot. Asthetically, these are prettier. But we'll have to think about a few things before we begin to plot. I am going to take these explanations directly from this fantastic tutorial.

The first thing we'll have to think about is "how are the sizes of the shapes determined". By using "symbols()" to draw circles, we are creating shapes using the radius. But we may want the area of the circle to represent the third variable, rather than radius. Additionally, we'll want to think about which variables we want on the axes, and which on should be used for the size of the bubbles. Sometimes this requires some playing around in R until you get your favorite visual. I go ahead and use stolen bases as the sizer for the bubbles and convert my sizes to area because it has a large range and bubble sizes vary nicely across teams.

##make use of area instead of radius for sizing

teams$radius <- sqrt(teams$SB/pi) head(teams) ##try doing this a different way using team hitting data symbols(teams$HR[teams$Year==2010], teams$BB[teams$Year==2010], circles=teams$radius[teams$Year==2010], inches=0.5, fg="darkblue", bg="#99000070", main="Team Home Runs, Walks & Stolen Bases", xlab="Home Runs", ylab="Walks") text(teams$HR[teams$Year==2010], teams$BB[teams$Year==2010], teams$Tm[teams$Year==2010], cex=0.6, col="black")



Now, the above code needs some explanation. Obviously, I only use teams from 2010. Within the symbols function, we first type what we want on the x-axis, followed by the y-axis just as in standard, non-equation plot code format. Then we specify which symbol we want to use by the command "circles=", followed by telling R to size the circles by the radius we created from our SB. Using this, the area of the circles will be equal to the number of stolen bases by the team.

The "inches=" argument simply tells R what the baseline size should be for the circles. Increasing it will make them larger, decreasing it will make them smaller. "fg=" and "bg=" tell R which colors we want the circles to be filled with and bordered with, respectively. I use the RGB scheme to make a transparent red color for filling the circles.

Unfortunately, the "pt.cex=" command in the "legend()" function does not size the points in the same way that symbols does. Similarly, I'm having some trouble creating a legend just plotting single circles in the top right of the plot (the scale is all off). If anyone has any suggestions, let me know. I'd love to hear it.

Qualitatively, though, we can see the outliers in the data. Tampa Bay doesn't have a lot of power (though, a bit above average), but runs a lot and walks a lot. Toronto on the other hand doesn't really steal, doesn't really walk, but mashed a bunch of HR in 2010. The Mariners and Astros were relatively useless in the power and walk department.

As usual, I have the R-code for this post below:

#############################
################Sidetrack: Bubble Plots and Transparent Colors (RGB Scale)
#############################

##set working directory and load data
setwd("c:/Users/Millsy/Dropbox/Blog Stuff/sab-R-metrics")

ball <- read.csv(file="soxyanks.csv", h=T)
head(ball)

##"boring plot"

ball$yank.att.z <- (ball$yank.att - mean(ball$yank.att))/sd(ball$yank.att)
ball$yank.win.z <- (ball$yank.win - mean(ball$yank.win))/sd(ball$yank.win)
head(ball)

png(file="boringplot.png", height=500, width=650)
plot(ball$yank.att.z ~ ball$year, xlab="Year", ylab="Normalized Win Percent & Attendance",
main="Yankee Wins and Attendance Across Time", col="darkblue", type="l", lwd=3, ylim=c(-3,3))
lines(ball$yank.win.z ~ ball$year, col="darkgray", lwd=3)
legend(1900, 3, legend=c("Yankee Attendance", "Yankee Wins"), fill=c("darkblue", "darkgrey"))
dev.off()


##attendance time plot
png(file="simpletime.png", height=500, width=650)
plot(ball$yank.att ~ ball$year, xlab="Year", ylab="Yankees vs. Red Sox Attendance",
main="Average Attendance Per Game", col="darkblue", type="l", lwd=3)
lines(ball$bos.att ~ ball$year, col="darkred", lwd=3)
legend(1900, 54000, legend=c("Yankee", "Red Sox"), fill=c("darkblue", "darkred"))
dev.off()

##yankees only bubble plot
png(file="yanksonly.png", height=650, width=1000)
plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col="darkgrey", main="Yankees Wins & Attendance Over Time",
xlab="Year", ylab="Average Game Attendance")
lines(yank.att ~ year, data=ball, lwd=2, col="darkblue")
legend(1900, 54000, legend=c(".250 W%", ".350 W%", ".450 W%", ".550 W%", ".650 W%", ".750 W%"), col="darkgrey",
pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=c(.6,.7,.8,1,1.25,1.5), bty="n")
dev.off()


##now do Yankee plot with transparent colorspng(file="yankees.png", height=500, width=650)
png(file="yankstransparent.png", height=650, width=1000)
plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col="#99999950", main="Yankees Wins & Attendance Over Time",
xlab="Year", ylab="Average Game Attendance")
lines(yank.att ~ year, data=ball, lwd=2, col="darkblue")
legend(1900, 54000, legend=c(".250 W%", ".350 W%", ".450 W%", ".550 W%", ".650 W%", ".750 W%"), col="#99999950",
pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=c(.6,.7,.8,1,1.25,1.5), bty="n")
dev.off()


##create bubble plot comparing yankees and red sox
png(file="YanksSoxBubble.png", height=650, width=1000)
plot(yank.att ~ year, data=ball, pch=16, cex=20*yank.win^3, col="#99999950", main="Yankees vs. Red Sox Wins and Attendance Over Time",
xlab="Year", ylab="Average Game Attendance")
points(bos.att ~ year, data=ball, pch=16, cex=20*bos.win^3, col="#99000050")
lines(yank.att ~ year, lwd=2, col="darkblue", data=ball)
lines(bos.att ~ year, lwd=2, col="darkred", data=ball)
legend(1900, 54000, legend=c(".250 W%", ".350 W%", ".450 W%", ".550 W%", ".650 W%", ".750 W%"), col="#99999950",
pch=16, pt.cex=c(20*.25^3, 20*.35^3, 20*.45^3, 20*.55^3, 20*.65^3, 20*.75^3), cex=1.6, bty="n")
dev.off()


#####Now use team hitting data

##make use of area instead of radius for sizing
teams$radius <- sqrt(teams$SB/pi)
head(teams)

##try doing this a different way using team hitting data
png(file="teamhittingbubble.png", height=650, width=1000)
symbols(teams$HR[teams$Year==2010], teams$BB[teams$Year==2010], circles=teams$radius[teams$Year==2010],
inches=0.5, fg="darkblue", bg="#99000070", main="Team Home Runs, Walks & Stolen Bases", xlab="Home Runs",
ylab="Walks")
text(teams$HR[teams$Year==2010], teams$BB[teams$Year==2010], teams$Tm[teams$Year==2010], cex=0.6, col="black")
dev.off()

Umpire Strike Zones in 2010

I've been working on a new R program that grabs batter-pitcher-umpire level data and creates heat maps for given parameters. My ultimate goal is to create my own function and tool to grab any heat map I'm interested in with a single line of code (sourcing the script, of course). This can be done pretty easily, and below I've presented my first attempt at using my first attempt at the function in a movie format.

For the heat maps presented here, I used the 'mgcv' package in R, which runs a binomial GAM model using cross-validation for the smoothing parameter. This is an important inclusion in writing a program to automate the creation of heat maps, as the variability, range of values, and sample size for pitches is different depending on the player or umpire being modeled. Using cross-validation, we can be sure to use some sort of optimal smoothing parameter given the data at hand for each individual umpire. This version of the GAM model actually uses smoothing splines, rather than a loess function, to smooth. The ultimate result is pretty much the same though.

Anyway, check out the videos below. I'm working on working with swing rates, run values, swinging strike rates, home run rates, ball-in-play rates, etc. for players as my next step. These are a little trickier given the smaller sample sizes for players and hence will likely need to use a standard Gaussian loess function even for binomial data, as there are some serious problems with a GAM model and small samples. I've done this already by umpire, by count. I'm not happy with the result of the loess for binomial strike zone calls, as the smoothing stretches way too far and the sample sizes are very small even for this method. They give the general idea of the relative strike zone changes by count (as J-Doug has been writing about at Beyond the Boxscore), but the visual is just misleading with respect to the actual strike zone size.

I've got a few ideas for this stuff, which I may advertise a bit later because I'll need some help to implement any of them. For now, enjoy the little slide shows below. Sorry I didn't provide each PNG file for your own inspection, but there are 78 umpires included in the data set (I removed some with extremely small sample sizes from 2010). Of course, I'm always happy to contribute some visuals to your website if you are interested in these.

In the videos below, the order of the umpires should be the same. Therefore, if you quickly click each one right after the other, they should start at about the same time and you can view RHB and LHB zones for the same umpire at the same time as it scrolls through.

I apologize for the crappy resolution in the videos. Apparently when it was converted it really messed with the quality of the images.

ANOTHER UPDATE: Thanks to the ability to embed a video from Facebook, I was able to improve on the resolution. Hooray for Facebook!