Monday, February 28, 2011

Spring Break, Woo!

I had hoped to add another edition of my sab-R-metrics series before heading off for my spring break, but unfortunately it will have to wait. I'm headed off to Napa Valley starting tomorrow and will not return until Monday night. Therefore, no new sab-R-metrics until next week.

I'm pretty excited to continue the series, covering some actual analysis of data from here on out using some cool techniques. I recently got a new laptop that has gotten me even more excited about implementing R for Pitch F/X data. With 8GB of RAM and an Intel Core i5, I'm able to load in over a million pitches at once. No small feat for R, which holds data in the active memory (one of its drawbacks). Revolutions is working on improving R for big data, but at this point it is limited to somewhat basic regression analysis. Unless you know how to run parallel processing, R is not yet optimal for processing big data sets.

BUT, the new computer seems to be doing well and I'm hoping to show it off in the coming months on this site. If anyone knows some fun things to do in the Napa Valley, Sacramento and San Francisco area, please tell me in a comment. I'd love to hear about things I should NOT miss while out there.

I'll Try to Be Gentle...

This morning, I was checking out Fangraphs fantasy and ran across an article discussing Fantasy value over replacement conepts by Zach Sanders. It was quite a familiar post, and I swore I had seen it before almost in its exact form. I love Fangraphs. Love, love it.

But I think someone had to call foul on this one (and Eriq did). Certainly it could be a coincidence, but this isn't the first time I've seen ideas from FBJ almost literally copied and pasted at other sites without attribution (Eriq and I have talked about this before). The Hardball Times posted an article regarding xADP about a week after we mentioned a forthcoming article at FBJ that would investigate this issue (with the same name). There have been other instances I won't mention, and of course it all could simply be honest independent thinking at the same time. After all, all of us baseball stat and fantasy nerds are usually thinking along the same lines.

Yes, certainly there are coincidences. And certainly I'm flattered when someone uses my work here at POS (yeah, didn't think out my acronym very well) and FBJ and give attribution for it. In fact, whenever I see a link back to me, I try and thank the person for doing so either in comments or through email. Ricky Zanker beat me to the punch about doing some R tutorial series for sabermetrics about a week after I mentioned it here. BUT, he also has consistently referred back to my site (and I to his). Similar work in these areas is greatly complementary, and I welcome it! So here is an open thank you to: Dave Allen, Gas House Graphs, Harry Pavlidis, Bay City Ball, J-Doug, Jason Rosenberg and anyone else who has referenced the fun I have here at my site.

But these fantasy posts are particularly egregious. There needs to be at least some due diligence in attribution of ideas or--if it really is original and independently thought up work--a search for similar studies across the internet.

The editors at these sites need to be much more careful. I'm all for free press and free information being sent in seconds across the interwebs. But correct attribution IS an issue I worry about at times. Whether or not it's more apparent on the web than in print is another question (and a tough one to answer). In fact, I've had suspicions about academic articles that come eerily close to some preliminary sharing on my blog of cool stuff I'm working on. I no longer share any academic work here...

So let's all just do our due diligence on this one, okay?


Addendum: I got a thoughtful email from Zach Sanders (who wrote the Fangraphs article) and I appreciate his attempt to reach out about the issue in this post. I would like to state that I have no intention of specifically pointing fingers at anyone, but to bring attention to an issue that can arise in a world of blogging, journalistic writing, and analysis of similar topics. Below is Zach's email to me (posted with his permission):

"
> Brian and Eriq,

>
>
> Someone pointed me to the articles on Prince of Slides and Fantasy
> Baseball Junkie that mention my post today on FanGraphs.
>
>
> I would like to state that I did not know of your article before
> hand, and certainly would never copy another writer's work. I take
> any accusations of this very seriously, so I thought it was
> important to reach out to you and touch base.
>
>
> If you have any questions, let me know.

............

>Brian,
>
>Thanks for understanding, and I completely get that you weren't trying to attack me. This is a >very sensitive issue, especially around the blossoming semi-anonymous internet writing world, >and with your past experience I can sympathize with your thoughts.
>
>Feel free to use my email as an addendum to the post.
>
>-Zach Sanders"


I enjoy Zach's fantasy pieces and love Fangraphs. I only mean this post to raise awareness as to being sure to research everything you can. I was not, and am not, accusing anyone of copying anything. I appreciate that Zach has been polite and gracious with our communication.

Wednesday, February 23, 2011

sab-R-metrics: Basic Applied Regression (OLS)

Today, I'll again be using a new data set that can be found here at my website (called 'leagueoutcomes.csv'). The data set includes the standings results of the 2009 season for MLB along with average game attendance by team. I'll use this to go over some basic regression techniques and tools in R. Hopefully this tutorial will help those with some more statistical background. Those looking to use fun data to learn things like Logistic Regression, Probits, and Non-Parametric smoothing methods should use this to get acquainted with the R fitting procedure and come back later for those tutorials.

Before doing any analysis, it's always a good idea to look at the data and make sure what we're doing makes sense. For this, we can create histograms and summarize the data in order to look at the distribution (standard regression assumes normal data, but for our purposes I won't worry too much about this here). Let's load the data and check out some properties of it. All of the code should be familiar from previous sab-R-metrics installments:

##set working directory, load data, and snoop around

setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")
league <- read.csv(file="leagueoutcomes.csv", h=T)

head(league)


summary(league)

hist(league$MLB.attend, col="steelblue", main="Game Attendance")

hist(league$MLB.win, col="pink", main="Win Percent")




plot(league$MLB.attend ~ league$MLB.win, main="Team Attendance and Winning", xlab="W%", ylab="Attendance Per Game", col="steelblue", pch=16)


text(league$MLB.win, league$MLB.attend, league$MLB.team, pos = 1, col="darkgrey")



As you can see in the scatter plot, the teams with better performance have more fans in attendance. Nothing surprising, with the Yankees at the top right and Pirates at the bottom-left. We can also see that the data tend to be a bit skewed for something like attendance (see the histograms), which is not surprising considering it is bounded at 0. A log transform may be appropriate here, but we'll ignore this for now to focus on the functionality of R as a regression tool. In addition, there is also some censoring going on. This is especially true in the NFL where sellouts are a regular occurrence. I won't be getting into much about assumptions behind regressions. There won't be any linear algebra or calculus here and I assume that those reading this have some idea of what regression is and when to use it. Of course, I'll throw in some assumptions of my own (and some "watch outs" like always). Again, I will ignore this for now, and perhaps I'll visit censored and truncated regression on another day (which I find easier to fit in STATA anyway).

Let's begin with a basic regression using a single predictor (OLS). In R, we can run a regression using the command "lm()", which of course stands for "linear model". Just like when plotting a Y variable against an X variable, we write an equation within this function. I'm interested in the relationship between winning and attendance per game. So, first let's calculate a correlation, then use those variables in a regression of Attendance on Win Percent for MLB.

##correlation and regression of attendance and winning

cor(league$MLB.attend, league$MLB.win)


fit.mlb <- lm(league$MLB.attend ~ league$MLB.win)

summary(fit.mlb)

As you can see, winning seems to be a significant predictor of attendance in MLB...no surprise there. The correlation is about 0.68, while the regression coefficient suggests that going from a win percent of 0 to 1.000 would result in about 83,436 additional fans. Of course, the intercept doesn't make much sense in our example (and neither does 83,000 fans with most stadiums having a max capacity around 40,000 fans).

This is a result of going beyond the data. No team had either a 0.000 W% or attendance of 0. Extrapolating beyond the data is bad juju for the most part. When there really are sellouts, I'd suggest toying around with a Tobit model, too. You can always force a regression to go through the origin or transform the data in some way, but there are some issues to deal with there.

So that the regression coefficient makes a little more sense, let's put Win Percent on a scale of 0 to 1000, rather than 0 to 1. Just do a quick transform of your data and rerun your regression with the transformed predictor variable:

##transform win percent

league$MLB.win.new <- league$MLB.win*1000

fit.mlb.2 <- lm(league$MLB.attend ~ league$MLB.win.new)
summary(fit.mlb.2)


Now, each point in win percent (0.001) is associated with an increase of about 83 fans per game. Let's go a little further with interpretation. We know there are 162 games in a season. Therefore, each win is worth about 0.006 in win percent. Therefore, it seems that on average, an MLB team can expect about 512 extra fans per game for each win.

But be careful! This does not mean that winning causes attendance to increase. In the long run, it very well may be the case that attendance and demand for winning results in these teams having a higher win percent. I believe I talked about this last time, but the relationship gets murky. If you're interested in academic work trying to tease out the causal effect, check out the VAR Analysis by Davis in IJSF (2008). But that's just the start of it. There are plenty of omitted variables in our simple regression--not to mention aggregating across cities with very different demographics--so making any strong inferences is not a very good idea. And don't forget that things change over time. This analysis is only from the 2009 season.

Okay...enough with the standard "yeah, buts". They could go on forever.

One of the nice features of R is that each of the fitted models can be saved as an object. Notice that I assigned my regression to the name "fit.mlb" and "fit.mlb.2". To summarize (i.e. give us a regression table like you would see in SPSS), we just use the "summary()" function with the object name. You can also use "print()" or the function "display()", which Gelman and Hill recommend in their Multilevel Models book (much recommended for a surface level review of regression in the early chapters and of course a great applied text for multilevel modeling). Really, it's up to you and what you think provides you with the information you want. We can also use this object-oriented language to our advantage when we want to plot our regression. But first, let's predict our attendance data.

Sometimes, we want to use regression to not only estimate a coefficient on some predictor variable, but also to predict future or unobserved data. Now, forecasting with regression is an entirely separate topic (talk about lots of "yeah buts"), but it's usually instructive to get our predicted values for the sake of understanding the error distribution from our standard OLS model. To do this, we use the appropriately named "predict()" function, which can be used for a number of other models (such as classification trees) built with R. To easily understand which prediction is which, appending them to the end of the data set can be useful. Also, we can use this to calculate errors simply by subtracting the actual value from the predicted (or the predicted from the actual value). For those that like shortcuts, you can just use the "resid()" function, where the model object is within parentheses.

I'll just use the original "fit.mlb" for this so that the plots are more intuitive later on (on the scale we usually think of Win % in). Lastly, I'll show some code to plot the errors to check for systematic patterns in our errors (which we do not want). If you're familiar with regression, you know that you want to see the errors in the last plot be generally random around 0 across the range of the x-values (W% here).

##predict and append fitted values and calculate errors

league$MLB.pred <- predict(fit.mlb, league, type="response")
league$MLB.err <- league$MLB.pred - league$MLB.attend

##or just simply use "resid()" function residuals.mlb <- resid(fit.mlb)

##plot errors to check for problems
plot(league$MLB.err ~ league$MLB.win, main="Error Distribution (MLB)", xlab="Team W%", ylab="Residual", pch=16, col="steelblue")

abline(h=0, lty="dashed", col="darkblue")


text(league$MLB.win, league$MLB.err, league$MLB.team, pos=1, col="darkgrey", cex=.7)


In the above plots, we can see that the Marlins prediction based on the team quality is way above the actual attendance (in other words, they had crappy attendance considering their relatively decent win percent), while the Mets are predicted well below their actual attendance (they have many more fans than they should, given their terrible quality). Of course, there are some omitted variables in this regression. Overall, there doesn't seem to be too much of a pattern in the residuals. The only problem I see is that the teams at the edge of the win percent distribution tend to be systematically under-predicted. It's really tough to tell a pattern with such a small data set, though. We'll assume everything is just dandy for now, despite the fact that it may not be.

There are a number of other diagnostic options in R for before and after regression, including Quantile-Quantile Plots (using "qqplot()"). I could go on forever about this stuff, but I'll leave it up to you to decide what is useful for you to learn, given your data and interests.

Now that we have fitted values and a regression model object for each league, let's plot our regression line in the scatterplot that we created early on for MLB. This is quite easy using the functions that we've already learned, and for adding a regression line, all that is needed is the "abline()" function. With multiple regression, this won't necessarily work, though, since we have to tell R which predictor variable to plot the line for. I'll cover multiple regression next time. Below I have the scatter plot with the added regression line, along with some simple coding:

###adding regression line to scatter plot

plot(league$MLB.attend ~ league$MLB.win, main="MLB Attendance and Winning",
xlab="W%", ylab="Attendance Per Game", col="steelblue", pch=16, cex=1.4, xlim=c(.35, .65), ylim=c(15000, 50000))

text(league$MLB.win, league$MLB.attend, league$MLB.team, pos=1, col="darkgrey", cex=.7)


abline(fit.mlb, lty="solid", col="black")



******SIDE TRACK******

Before I get to plotting the regression line, I'm going to side track a bit and talk about installing packages. The R network has a package for just about anything you want to do. Thus far, we have only used the base version of R, but there are plenty of free add-ons. For our plots, I want you to go ahead and download a package that will easily plot prediction intervals from your regression.

To begin, go up to the menu at the top of your R window and click:

Packages ==> Install Package(s)

From there, choose a location near you and click OK. Next, you'll be prompted with a big list of packages. Usually, you want to know what each one is before downloading, and you can find this at R's main website. However, just go ahead and download the "HH" package.

R should automatically find a place for this package on your computer so that it can source it later on. In your R script, type:

library(HH)

This will have the package ready to go in R. Be sure to use this command whenever you want to use a function not in the base package for R. Now, let's put this to use.

******END SIDE TRACK******

Here, we'll use the "ci.plot()" function, which is relatively straight forward. Use the "help()" function for any added flexibility you desire. I just discovered this function, and it's an easy way to plot confidence intervals and prediction intervals for your data. If you want flexibility in your plot, you'll want to do this manually (again, I'd recommend Gelmand and Hill's book for learning the basics of calculating these intervals yourself). I'm not going to provide the code here for doing it manually, as the function below works relatively well for quick plotting purposes. In general, I don't like the formatting of the legend, etc. However, it's so quick and easy!

##using ci.plot for regression line plot with 95% confidence and prediction intervals

library(HH)


ci.plot(fit.mlb, xlab="Win %", ylab="Game Attendance", main="ci.plot() Version of OLS Model Plot",
col=c("steelblue"), pch=16, cex=1, xlim=c(.35,.65), ylim=c(15000,50000))



You can see that there is more uncertainty near the edges of the range of our X-values. That is expected, as there are less data at the extremes here. Notice the upward trend line as well. This confirms our intuition, our correlational analysis and our simple regression model coefficient for win percent: Winning and Attendance are positively related.

The "lm()" function works for multiple regression as well. Unfortunately, plots become much more complicated. For this, I'll introduce a scatterplot matrix in my next post, along with a multiple regression that includes an interaction terms and a factor variable.

I'll cover logistic and probit regression in R using "glm()" following the multiple regression example and then finally get to loess smoothing using the "loess()" function and its variants like "gam()" (all very handy tools in sabermetric analysis). For those unfamiliar with the basics of statistics and regression analysis, check out the many tools online for free or R books available at Amazon and other places. I'll again assume that those reading this have some idea of why you would use these tools.

As usual, I have my pretty R code below:

Pretty-R is offline right now. I will post it here when it's back up and running.

Wednesday, February 16, 2011

Non-Sports, Non-R Potpurri

Today, I'm just here to relay a few thoughts on 3 interesting topics I've run across lately. Here it goes. Brain dump time.

1. Why is there a graduate student union ("GEO") at the University of Michigan?

I'll begin here by saying I'm not a fan of unions and I'm also not a fan of receiving snotty emails when I ask questions to the GEO officer here. Last fall, my department asked me to teach a new experimental class (I am not required to teach at all). I said sure as I needed the experience, and this was under the assumption that I would be paid the same rate (which, really, is a pay cut since I was still expected to do all my other work). However, despite the fact that my pay did NOT depend on me teaching a class, I was required to pay $157 in union dues. The GEO has a contract with the University that all Grad Student Instructors must pay dues, whether they are part of the union or not. Okay, fine. But when I emailed the rep to tell them that I had no interest in representation and that I technically was not funded as a GSI, I got a snotty email response. I refused to give them any information related to my pay or my account at that point. They responded by automatically deducting my paycheck with no notice.

So the above is just to tell you that I'm already biased against the GEO and its functionality. But honestly, from a completely neutral standpoint, it makes little sense to have a union for graduate students. The only reason I can see the requirement for paying dues being important is that, with short-term employees, there is not much incentive to increase pay and benefits for the future...there is no future employment with the organization. However, let's think a little deeper as to why anyone would have bothered in the first place.

When I was accepted into the PhD program here, I was sent a letter detailing my pay and benefits. I could have said "No", but I took on the low pay for being able to build a career that I love (yea, I get to sit in my office and think about baseball). If the money wasn't enough, then the University likely would not be able to get talented graduate students to sacrifice the time and money while here. Research production would go down (there are PLENTY of labs where the only work coming out of them is by graduate students with the adviser on the paper for name recognition).

But there's more. Not only could students turn down the offer for pay, etc., but for the most part (and especially at a place like Michigan) the graduate students are very skilled/multi-talented people that would get interest elsewhere. The UAW has a union because the auto workers have a single skill, not very marketable elsewhere. That's not to say it's not an important skill--it's one I certainly do not have--but is a skill which others can learn. Major League Baseball players also have a union because playing baseball is not a very marketable skill outside of...well...baseball. There is little alternative for baseball players to earn the type of MRP they produce in baseball. Maybe selling cars at a used car dealership would net them $50K a year? Banding together and having a union makes sense here--especially in the face of hilariously obvious collusion by owners.

But this isn't the case for most graduate students. Grad students have more general skills--hard working, motivated, mathematical, managerial/leadership potential--that could earn money elsewhere. Guess what. If grad students weren't getting enough money, they'd just go elsewhere. Remember that there is a sacrifice now for increased utility in the future (i emphasize utility because college faculty positions aren't exactly big pay days).

There's another problem with justifying a union for graduate students: it reduces flexibility. Technically, I'm only supposed to be working 20 hours a week on research. Therefore, I'm paid accordingly. I have worked some consulting jobs in 'spare time' for extra money, but if I were part of the GEO, they could technically limit this since my work was with professors here. Additional research projects are limited so that graduate students aren't "worked too hard". Graders are only allotted a certain number of hours that they can be paid for--whether this is a union doing, I am not sure. But in general, a union reduces the flexibility to work extra or work hard, because--I guess--it looks bad for the lazy members in the union.

I don't think I know a single person that supports the grad student union. Honestly, those running it must have some self-important view on what they are actually doing in the world. If that offends anyone that I know, I apologize and I invite you to tell me why I am wrong. But striking (i.e. not teaching classes) in the face of $50 a year raises or whatever certainly seems like a slap in the face to what students are here for in the first place: advancing education.


2. Are 'local currencies' really successful?

A friend of mine on Facebook recently posted this link asking for support of a movement to start a "Bnote" in Baltimore. Now, I think it is a very interesting concept which has apparently picked up a lot of steam in the U.S. and in Europe. I want to leave Europe and the EU aside for now, as they have some very different views on price and money concepts from the U.S. I'm curious what others think of having Disney Bucks for a city. The claim is that it keeps money inside the city, rewarding those who use them by giving them $11 Bnotes for $10 US. Stores are accepting $10 Bnotes as $10, so the idea is you get an extra $1. Since they can only be spent within the city, the money 'doesn't leave'. But I'm extremely skeptical for a number of reasons. The first is that this $1 incentive will likely be circumvented by store owners by simply raising prices by some weighted amount of expected Bnotes to expected US$$, ultimately reducing any sort of 'deal'. I mean, do small businesses make a 10% profit to cover that extra $1?

Otherwise, here are some concerns/interests I have with local currency for inducing economic growth:

A. Wit...h something like this, transactions costs are greatly increased, enough that they very well could offset any economic growth (a speculation).

B. The 10% incentive for exchange may be enough for some people to use them, but I think there is a serious problem: stores within the community very likely don't pull a 10% profit on sales. If they're accepting $10 for a book that could be sold for $11, with a cost of $10.01, then it's a problem.

C. Any claim of regional economic strengthening should be tempered by the fact that those communities that do this likely would have improved anyway due to the activism to do so within that region. Therefore, the question is: "Are Bnotes a better alternative to other strategies", rather than "Do cities with community currency improve". And that's the research I'd like to read (not the Googled research by Local Newspaper X)

D. There is in fact a disincentive to use Bnotes: flexibility.

E. There is a real cost (outside transactions costs) to exchanging $$ for Bnotes. These include: walking to the Bnote exchange, standing in line, carrying cash rather than using a card...

F. Given the vastly decreased use of material cash in the US, are Bnotes available electronically? If not, then this only increases the costs to using them (carrying them, running out of cash).

G. The people using Bnotes likely already live in the region and spend their money there. The people working (but not living) in the region are still paid in dollars and take that outside the region. Unless the Bnotes are specifically used for Business-to-Business transactions to keep these $$ within the city, I see little change. However, restricting B2B transactions to within-city for ANY employer would seem an extremely inefficient way to do business. Inefficiency is okay if people are willing to pay the cost and forego goods they want from outside the area. But the question is: what are they willing to forego?



G is the biggest concern of mine. Just like with taxation, we have to understand what people view as the optimal 'values' in order to assess the willingness to give up on efficiency. My guess is simply this: the people using Bnotes are those that would have already spent their money within the city limits anyway. Therefore, these local businesses are giving up on $1 per $10. But, I'm curious to hear what those with experience in this area have to say. Please keep the rhetoric to a minimum.


3. Should teachers be able to ananomously post concerns and frustrations on a blog?

A hat tip to Tango for this link. Another note that I'm not anonymous here. As someone who did teach a class here at Michigan, I must say my experience was much the same as the teacher in this article. In fact--as I commented at The Book Blog--I had students complain for being graded down on an essay question because they did it in bullet-point form. I'll also note that the bullets weren't in complete sentences. I'll also note that one student came into my office to tell me (word for word) that my "Test is bullshit". Lastly, I'll note that the average grade was a B, not some ridiculous F-average.

I'll begin by saying that I've noticed this general attitude at U of M. I could care less about engagement--that actually is something up to the teacher, and something I did not do well at in my first run of teaching. I'm okay with that, and I look to improve in the future. My focus, however, is the 'taking responsibility' aspect of student attitudes that is lacking and pervasive across all levels of education. I've never experienced it more than I have here at U of M.

Interestingly, there really is a sense of false entitlement that fills the air here. It begins with describing someone like Rich Rodriguez as "Not a Michigan Man". It applies to both academics and to sports. People from Michigan see U of M as an Ivy League school. They laugh at those attending Michigan State.Those from the east coast laugh at this. Of course, the truth is that the quality is somewhere in the middle of these two extremes. It's also true that Michigan State, on the whole, is a good school. There are plenty of extremely smart people both there and here.

But the attitude seems to be: "Well, I got into Michigan, so obviously I'm a straight-A student. If you don't give me an A, then there must be something wrong with you." But then what's the point? Shouldn't we differentiate students between one another, given that they go to Michigan? The answer is of course. Unfortunately, the students don't see it this way and ultimately get offended that anyone would dare to give them a B+. If that's the lowest grade, then it's simple grade inflation. All that needs to be done is communicated to all teachers that B+ now is equal to F. Since faculty are essentially the employees of the students (they pay tuition after all), it seems reasonable to have some level of grade inflation without sacrificing information in order to give the customers what they pay for: a degree.

Grade inflation is one thing, and it can be a problem. But it's not a problem if you can still differentiate students (I won't get into grade inflation, and honestly the problem is overblown sometimes--really it's an issue of relative information loss which can still be upheld without giving out F's). However, I'm not convinced that the current grade distribution at Michigan gives enough information about the spread of abilities.

There are a lot of very smart kids here. That means that the generalization about entitlement is certainly not applicable to everyone, or near everyone. But if a teacher is frustrated with it anonymously, then what's the problem? My question is how students found out who was writing the blog. That is on the teacher's own stupidity. And that's only going to backfire when she is back in the classroom. Should she lose her job over this? No. Should she be surprised that parents and kids are pissed off that she's calling them lazy morons? No. Both sides need to take some accountability. But free speech is free speech, as long as it's not libel. My guess is that the frustrations by the teacher are fairly warranted, so the idea that it is made up would be in doubt.

Okay, rants over. I'd love reasoned comments. Hopefully, I'll have another sab-R-metrics piece up next week. I'll be out of town this weekend for a tasting for wedding and cake food. Mmmmm.

Sunday, February 13, 2011

sab-R-metrics: Displaying Line Plots and Time Series Data

It's been a while since I've had the chance to add anything here, but last time I left everyone with some scatter plots and some customization tools for your graphics. This week will be a little more brief than the last few tutorials and what I'd like to do is show you how to display line graphs for time series data. For this, I'll be using New York Yankees attendance and win percent from 1903 through 2010. I grabbed it from Sports Business Data and you can find it HERE at my website (along with the data from the other tutorials here at the site).

Go ahead and set your working directory to the one you prefer and put the data file in that folder. We'll just name our data "yanks":

##load data (already set working directory)

yanks <- read.csv(file="NYYapg.csv", h=T)
head(attend)

You can see here that there are three (3) columns of data in the file: year, win, att. These are straight forward, as 'year' is just the Year of that season, 'win' is the Yankees' win percent in that season, and 'att' is the average per game attendance for the Yankees home games that year.

Now, we could always go back and try some scatter plots with this new data. Say, for instance, we're interested in the relationship between winning and attendance, we could simply plot:

#plot relationship of winning and attendance

png(file="winatt.png", width=600, height=500)

plot(yanks$att ~ yanks$win, xlab="Yankee Win Percent", ylab="Yankee Per Game Attendance", main="Yankee Win Percent vs. Attendance", pch=16, col="darkblue")

dev.off()


Now this gives us an okay relationship, and it tells us pretty much what we'd expect: winning teams tend to get more fans (see the upward slope of the points from left to right). On the other hand, the relationship is a bit of a mess: attendance was just generally lower early on in Yankee history than it is now just by general increasing demand for baseball. And that's not to mention the fact that the data is capped at a sellout point that may mask some relationship in this simple look.

Also, there is some ambiguity on the direction of causality of winning and attendance. Of course people come to the game to see the team win, but the team also wins through investment and the causality in the long-run can be seen as going the other way around. Each of these are serious econometric issues in sports economics and sport management that I don't plan to get into on this website.

Back to R. Here, I want to get into plotting lines. The best way to start is to use time series data. So instead of plotting these two variables, let's separately plot each one across time. This is pretty simple, and we'll begin again with the scatter plot of each across time:

##plot variables as function of time
png(file="winattbyyear.png", height=800, width=1400)

par(mfrow=c(2,1))

plot(yanks$att ~ yanks$year, xlab="Year", ylab="Yankee Per Game Attendance",
main="Yankee Attendance Across Time", col="darkblue", pch=16)

plot(yanks$win ~ yanks$year, xlab="Year", ylab="Yankee Win Percent",
main="Yankee Success Across Time", col="darkgray", pch=16)

dev.off()


Now this gives us a nice little picture of how attendance has increased over time. Obviously, average Yankee attendance wasn't the same in 1921as in 2009, despite nearly identical winning percentages. This is why you have to be careful at looking at relationships over time. But we can do better than putting points on the plot like this. Looking at the win percent plot, it's difficult to gauge any patterns or cycles in the Yankees winning. Usually a line plot is a better way to go about this. There are two easy ways to do this, and I'll start by adjusting our "plot" function:

##draw line plots of each over time
png(file="winattbyyearLINE.png", height=800, width=1400)

par(mfrow=c(2,1))

plot(yanks$att ~ yanks$year, xlab="Year", ylab="Yankee Per Game Attendance",
main="Yankee Attendance Across Time", col="darkblue", type="l", lwd=2)

plot(yanks$win ~ yanks$year, xlab="Year", ylab="Yankee Win Percent",
main="Yankee Success Across Time", col="darkgray", type="l", lwd=2)

dev.off()


As you can see in the code above I identified the plot type by 'type="l"' (that's a lower case L) to tell R that I want it to make a line out of the data. In addition, I used "lwd=2" to make the lines a bit thicker (the default is 1). However, R has a nice easy way to plot time series data without having to specify a line plot. We can simply use "plot.ts", and R will understand what we're doing. The code below should generate the same plots, but using this new function. You can see that there is no formula for the plotting here, just the dependent variable (the one you want to plot across time). The disadvantage, however, is that you need to customize the x-axis for it to show the year tickmarks instead of the observation numbers. I prefer to use the simple 'plot' function with type="l" to avoid this, but sometimes you may want to customize that axis anyway. It's really up to you and what you are comfortable with.

###draw line plots of each over time using plot.ts
png(file="winattbyyearTS.png", height=800, width=1400)

par(mfrow=c(2,1))

plot.ts(yanks$att, xlab="Year", ylab="Yankee Per Game Attendance",
main="Yankee Attendance Across Time", col="darkblue", lwd=2)

plot.ts(yanks$win, xlab="Year", ylab="Yankee Win Percent",
main="Yankee Success Across Time", col="darkgray", lwd=2)

dev.off()


One thing to remember is that we can also first plot the points, then the lines on the plot using "lines()" after you do your scatter plot. Depending on your objectives, this may be helpful to identify where each year is on the line (remember "cex=" tells R what size to make your points, while "pch=" tells it what types of points you'd like to use).

##Points and lines
png(file="ptsandlines.png", height=800, width=1400)

par(mfrow=c(2,1))

plot(yanks$att ~ yanks$year, xlab="Year", ylab="Yankee Per Game Attendance",
main="Yankee Attendance Across Time", col="darkblue", pch=16, cex=1.5)
lines(yanks$att ~ yanks$year, lwd=2, col="darkgrey")

plot(yanks$win ~ yanks$year, xlab="Year", ylab="Yankee Win Percent",
main="Yankee Success Across Time", col="darkgray", pch=16, cex=1.5)
lines(yanks$win ~ yanks$year, lwd=2, col="darkblue")

dev.off()


So that's what I have for the beginners today. I know it's a bit short and not much in addition to last time, but I'm swamped today. Next time, I'm going to get into plotting a regression (and loess) line on these plots as well as cover some more color options like backgrounds and transparent colors. Using a regression line, we can get a better idea of the association on a scatter plot than simply looking at the points, while a loess regression can help to identify patterns in data (and can be very useful in visual inspection of time series data). As usual, I post the pretty R code below for this post:


######################
###Line Plots and Time Series Plots
######################

#setting directory and opening Shaun Marcum 2010 Pitch F/X data
setwd("c:/Users/Brian/Documents/My Dropbox/Blog Stuff/sab-R-metrics")

#load data
yanks <- read.csv(file="NYYapg.csv", h=T)
head(yanks)

#plot relationship of winning and attendance

png(file="winatt.png", width=600, height=500)
plot(yanks$att ~ yanks$win, xlab="Yankee Win Percent", ylab="Yankee Per Game Attendance",
main="Yankee Win Percent vs. Attendance", pch=16, col="darkblue")
dev.off()

##plot variables as function of time
png(file="winattbyyear.png", height=800, width=1400)

par(mfrow=c(2,1))
plot(yanks$att ~ yanks$year, xlab="Year", ylab="Yankee Per Game Attendance",
main="Yankee Attendance Across Time", col="darkblue", pch=16)
plot(yanks$win ~ yanks$year, xlab="Year", ylab="Yankee Win Percent",
main="Yankee Success Across Time", col="darkgray", pch=16)

dev.off()

###draw line plots of each over time
png(file="winattbyyearLINE.png", height=800, width=1400)

par(mfrow=c(2,1))
plot(yanks$att ~ yanks$year, xlab="Year", ylab="Yankee Per Game Attendance",
main="Yankee Attendance Across Time", col="darkblue", type="l", lwd=2)
plot(yanks$win ~ yanks$year, xlab="Year", ylab="Yankee Win Percent",
main="Yankee Success Across Time", col="darkgray", , type="l", lwd=2)

dev.off()

###draw line plots of each over time using plot.ts
png(file="winattbyyearTS.png", height=800, width=1400)

par(mfrow=c(2,1))
plot.ts(yanks$att, xlab="Year", ylab="Yankee Per Game Attendance",
main="Yankee Attendance Across Time", col="darkblue", lwd=2)
plot.ts(yanks$win, xlab="Year", ylab="Yankee Win Percent",
main="Yankee Success Across Time", col="darkgray", lwd=2)

dev.off()

##Points and lines
png(file="ptsandlines.png", height=800, width=1400)

par(mfrow=c(2,1))
plot(yanks$att ~ yanks$year, xlab="Year", ylab="Yankee Per Game Attendance",
main="Yankee Attendance Across Time", col="darkblue", pch=16, cex=1.5)
lines(yanks$att ~ yanks$year, lwd=2, col="darkgrey")
plot(yanks$win ~ yanks$year, xlab="Year", ylab="Yankee Win Percent",
main="Yankee Success Across Time", col="darkgray", pch=16, cex=1.5)
lines(yanks$win ~ yanks$year, lwd=2, col="darkblue")

dev.off()

Created by Pretty R at inside-R.org

Tuesday, February 8, 2011

Fantasy Ball Junkie: Designing Your Perfect League

I have a new post up at Fantasy Ball Junkie. I basically discuss things to look out for when designing your own custom league. It's important to remember that there is no universally perfect league. You design your own league to cater to what you want out of fantasy baseball. But that doesn't mean you should disregard the consequences of implementing certain rules structures. In the article, I briefly talk about caveats and consequences of certain rules in order to guide you in your quest to design your own perfect fantasy league.

Wednesday, February 2, 2011

Fixing Up smoothScatter Heat Maps

A while back, I posted an article using the smoothScatter function in R that builds a color representation of density for scatter plots. When I first found the function, I was extremely excited because it's a very easy and automated way to make a heat map! Unfortunately, the more I messed with the function, the more annoying it became. But that's not to say it doesn't produce very very pretty pictures.

I've had a lot of inquiries regarding this function lately, as Harry Pavlidis at THT, Dave Allen at Fangraphs, and Chris Quick at Bay City Ball have implemented it in recent articles elsewhere based on my original code. However, there are some problems with the function: it automatically chooses the range for the data to be plotted.

Now, it absolutely should pick how far out to extrapolate a kernel smoother (it's generally not a good idea to ever go outside the bounds of the data). However, the ability to control the plotting is a bit wonky. In the case of this function, it chooses the axes in a way that is often off-center or not comparable across different data with different ranges. This is a key attribute needed for plotting Pitch F/X data. I've tried using the xlim and ylim options, but this unfortunately makes things worse. If you use these within smoothScatter, it just leaves a bunch of white background beyond where the function chooses to smooth the data. See below for the problems we can run into:

Whitespace:

Off-center:


Chris and others have inquired about this, and I found a few fixes...none of which are great, but I don't think there are any other options.


Option 1:
Create a color palette in which the color representing the lowest density is white.

For this, when you indicate the colors to be used for smoothing in colramp=colRampPalette=c("col1", "col2"....), your first entry should be "white". Choose carefully, as a white background usually works best with a single color or group of similar colors (i.e. Red Only, Blue Only, Red/Orange). This works okay, but I don't think it looks quite as nice as having a darker background. A darker background really makes things 'pop'. Below I have a Bruce Froemming "Called Strike" and "Called Ball" pitch density map by location using this white background using an all-red palette:



Option 2:
Use par(bg="") just before smoothScatter.

This option works as long as you indicate the color for "bg" (means background) to be the first color in your smoothing palette. This way, you can set your axes the way you want, and everything that is white from before will just be filled in essentially as zero density. Unfortunately, this also colors the background beyond your axes and into your plot title and axis labels. This is certainly not optimal, but if you use the right colors it may not turn out too bad. Notice how dark things look even with a Red palette:



Option 3:
Use rect() to draw filled rectangles in the ranges where the function does not fill when you custom-set your axes.

This is the most flexible option. Unfortunately, it involves some guess-and-check to be sure you don't overlap your rectangles on top of areas where there is some pitch density. This isn't as easy as it sounds, and sometimes is impossible (especially if there is some density of pitches near the edges of what smoothScatter chose to plot). For this method, we use the "rect()" function and indicate "col=" within the function to tell it what to fill the rectangle, as well as "border=" to indicate the border to be the same color. See if you can tell where the rectangles begin at the edges of the plot below. In some places you can see evidence of a line that overlapped where I would rather it didn't:


Finally, an extra suggestion: use the "bandwidth=" option in your smoothScatter plots. I had not bothered with this on my first run with the function, and it uses and automatic bandwidth chosen by the "bk2de" function it calls from. For the data I've worked with, 0.20 or 0.25 works relatively well. Of course it depends on your data and what you want out of the plot to determine the optimal smoothing really is.

That's all I've got for now. I wanted to get this up to help people out a little bit, but I have to get back to my work (they expect me to finish this dissertation at some point, I guess). I really think this function makes some of the best looking heat maps out there, I just wish there was a little more customization possible with it. Good luck!

And for good measure, here is my original color scheme that I really love. Just not sure I like the background of everything to be so dark:



Addition: See the comment section for another suggestion by Dave Armstrong. His solution is far easier. I had tried this before, but ran into problems when I forgot to include "add=T" to the parameters within smoothScatter. There are still distinct edges to the image, though, and I'm going to try and see if I can fix things up within the function myself. (Don't expect too much from me on that part!)

Addition 2: Dave beat me to the punch and fixed up some inner workings of the function. I want to thank him for his help. This is why using R for research and analysis is great: there is a huge support system everywhere! And there is always something new to learn.

Tuesday, February 1, 2011

Neyer at SB Nation

Rob Neyer has a first article up at SB Nation. I'm posting this here not because I am a huge Neyer fan, but because of the statement it makes about the changing landscape of media. In fact, I've never read a Neyer article in full. Not because I don't find him intriguing or intelligent, but largely because I haven't had a lot of time to do so.

No, the significance here is that Neyer is a member of the BBWAA who left the most well known sports media company in the world to write for a website boasting to be all about the Fan Perspective. His first article explains his thoughts about an "Us vs. Them" mentality he experienced as a writer for what people like to call the mainstream media. I think this is the shift that is significant.

Neyer's fame in baseball circles--at least the way I know of him--seems to stem from his following of sabermetric oriented minds. In a world of changing media and information sharing, Neyer has taken a step that I imagine many writers will scoff at. While the move of one writer from ESPN to another large sports network, SB Nation isn't in itself a huge deal, it seems as though this may be a symbol of the changing role of professional writers and newspaper or magazine columns as they once were known. Media and information has been changing for a while, and the move is another example of this fact.

Will more jump ship? Does a place like SB Nation have the resources to attract more writers away? And what impact does this have on the make up of a site like SB Nation: a site claiming to be "Pro Quality. Fan Perspective"? I'm certainly not suggesting that Rob Neyer isn't a fan and student of sport, but the line for SB Nation seems to become a bit blurred. Not in a bad way by any means. But in a way that really seems to shift the focus of writing from "telling" to "discussing" not only for fans, but for writers with authority in their respective field. I think that's good for everyone, and it seems as though the right person was picked to cross this line.

I am interested to see what SB Nation and Neyer have planned for this marriage.