Thursday, December 1, 2011

A Book on Umpire Performance

I ran across this today when trying to find a paper on Google scholar. While it sounds interesting, there is no use of pitch location data as far as I can tell. Mostly, it seems to report ball and strike percentages and ultimate game outcomes, with a profile for each umpire. Certainly interesting to see, though I provided much of this for free in one of my previous posts (with pitch location information included). For those, see HERE and HERE and HERE.

Anyway, thought some visitors may be interested.

Wednesday, November 30, 2011

New MLB CBA: Owners win!

I haven't had much to say about the MLB CBA that was recently agreed upon. Really, the whole thing seems a bit silly, and I'm not totally sure how some of the things will play out. Whatever the result, I don't think it's nearly as much as some seem to be howling about. What should be howled about is the screw job on the young players, not so much the implications for professional baseball.

It will probably give slightly more incentive for players to go to college, I guess. But that only depends on the return to go to college, as they'll be subject to negotiations with only a single team next time their drafted. If so, then so what? College players generally make it to MLB a bit quicker, so they may still arrive at the same time (or just slightly later) they would have otherwise. But ultimately, those players will end up in MLB. How many multi-sport players are we really talking about? I can't imagine this is significant at all to the total talent in the league. Here's a few thoughts.

International Players
The idea that large market teams don't invest heavily in international players, and that this cap will only affect the small market teams seems ridiculous to me. If anything, I think the cap gives smaller teams more incentive to invest in international talent and training.

Before, teams really had to worry about investing heavily in training only to watch international players sign with a team that had more money. Now, the return to training is higher (less bonuses) and the probability that the player will sign with the team that trained them is higher (market has less huge offers).

This could have the effect of redistributing talent across all teams, certainly. But this, just like before, is a free agent market for these players. Yes, with more uncertainty and less leverage for the players, but a competitive market nonetheless. Before making a final judgment here, I would have to see the distribution of international spending by team. I suspect it is not highly negatively correlated with market size, meaning the effect is just cheaper talent.

This restriction probably helps all owners, unless it creates a situation in which big market teams decide to only target 'sure thing' players, and allocate all of their cap to a single player trained by another team. Not sure how reasonable of a worry that is, but I'd be interested in someone enlightening me on the data here. Again, the loser here is the international players and the bonuses they're paid to feed their families.

Slotting in the Draft
First, I think it's a bit silly that if a pick isn't be signed, those dollars can't go toward signing another player. This is pretty classic "screw the guys with no representation" at work. The owners' interest in the slotting is pretty obvious: they get cheap talent even cheaper. The veterans, I think, were a bit misguided on this one. They probably assume that the money not going to the draft picks will be reallocated to them in the free agent market. As Lee Corso would say, "Not so fast my friend!" (As a note, this thought was sparked by a question from Sky Kalkman on Twitter, so I'm going to try and lay it out fully here.)

Outside of the impact on the likelihood of signing picks, the slotting really has no consequences in the free agent market (if they don't sign them, they might want to spend some money to replace this talent they would have otherwise had). Assuming that the large majority of draft picks are signed (they really don't have a more lucrative option, do they?), the teams are getting the same amount of talent they had before, but for less money. So there's a large surplus in the draft. Surplus has to go somewhere right!?!

Right!...into the owners' pockets (or into another, more lucrative investment). Major League Baseball has no requirement on the percentage of revenues that must go toward MLB salaries, outside of the minimum salary requirements to individual players. Even if it did, I'm not sure that this would cover rookie signing bonuses. But let's assume it did have this, and it included both draft bonuses and salaries. Then teams would have to make up that spending somewhere else. Depending on the stipulations of the minimums on payroll and bonuses, it very well could go into the free agent market. But this isn't the case. The teams have the same talent they did before, but for cheaper. Would buying more talent make sense?

It could, under certain conditions. But I don't think these hold. First, it would require that their marginal revenue be above the marginal cost. In other words, teams spend up to some point where these two are equal. But they're already doing this, under the assumption of profit maximization.

Unless the marginal revenue for an additional win increases or the marginal cost decreases, there isn't an incentive to spend more. Yes, the total cost (average cost of a win) decreased for them due to the slotting on the whole. But, in the competitive market for free agents, there is no reason to believe that the cost for one more win has gone down (in fact, if you believe more money is being reallocated there, it would increase!). As for marginal revenue, why would that increase? If the CBA increased interest in the sport as a whole, then maybe it would a tiny bit. But I doubt that's the case.

Here's an example. Let's say the Rays spent $15 million on rookies last year. At first glance, a veteran might say "Hey, this year they only have to spend $2 million. Then they'll spend the other $13 million on us! Woohoo!"

But--taking into account the uncertainty of draft and veteran talent in the future (let's just talk in expected values) and signing all picks--they have the same talent they did last year. To buy more talent, they have to enter the competitive free agent market. Last year, let's say an additional expected win for Tampa would increase revenues by $3.9 million (no reason to think this changed from last year due to the new rules). Similarly, the market price for a free agent is at $4 million (and why would this decrease now?). It isn't rational to spend $4 million for the additional $3.9 million in revenues. Sure, they could get from 88 to 89 expected wins for cheaper than they could last year. But it would decrease overall profits to reallocate that money under these new rules. Therefore, it's not likely that they'll go into the free agent market and spend that money by choice.

Now, there is an exception to this. If under collective bargaining, the agreement was to slot draft picks only if the minimum salary was increased, then some goes to the players. It is bargaining after all, and we can't assume that all the vets are stupid. In fact, the minimum MLB and 40-man roster minimum salaries were increased by about 16% each. But this minimum still does not go into calculating the marginal cost of an additional win. In, say, a WAR equation you just add it to the total for each player: Salary = $414,000*1.16 + $B*WAR. Therefore, it's a new fixed cost of operation in MLB (a new intercept for the equation). The amount you pay for marginal wins is independent of the minimum salary in this case. The total additional cost to owners is ($480,000-$414,000) * 25 players * 30 teams = $49,500,000 plus the 40-man increase ($78,250-$67,300)*15*30 = $4,927,500. Include guys getting sent up and down, and we'll put it at a cool $55 million.

According to Jim Callis of Baseball America, teams spent $192 million on bonuses in the first 10 Rounds of the 2010 MLB draft (I think I'm reporting this correctly). That means that the owners are left with a surplus of $192 - $55 = $137 million total. So, here both the owners (+$137 million) and players (+$55 million) gain from the new agreement. Other than the agreed upon increase in minimum salary, there isn't any reason to believe that owners will reallocate this money saved in the draft to free agents.

Certainly these results change under revenue or win maximization. But for North American sports, profit maximization is usually assumed to best describe owners.

As always, I welcome (and enjoy) any thoughts and criticisms or clarifications about things I've misunderstood here.

ADDENDUM: I realized that I didn't subtract from the $192 the allocation that still goes to the draft picks. Woops! Should be same conclusion, just less share than $137 million for the owners. Assuming they spend $2 million on average, they'd be looking at more like a $77 million surplus in their pockets. At $3 million, you're at $47 million, and so on. Previously, they were at a $6.4 million average total per team in the first 10 Rounds. Not sure what the exact slotting will be, but if we assume it cuts this in half, then we're near the $50 million mark and the players and owners may have split this up near 50-50. This likely means that, in terms of percentage increase, the lower-level players (Craig Counsells of the world) are sitting pretty.

Monday, November 28, 2011

Soared in Value? Probably Not As Much as Indicated

Freakonomics reports on this Pittsburgh Post-Gazette article claiming that Steelers personal seat licenses have increased by as much at 1,400 percent. While I won't deny that they have probably increased in value, this is an overstatement. Here's why.

They compare the current secondary market prices to those posted by the Steelers in 2001. The first mistake here is not understanding that most teams price to sell out (in the inelastic portion of demand). This is also likely true with PSLs. There are a few reasons for this, one being that they maximize profits not just on ticket sales, but also on concessions and memorabilia within the stadium. Secondly, there is also some economic theory that places like sports and restaurants keep prices low, as their product depends on other people also liking it and consuming it (and they have a fixed supply). There could be evidence of lashback if the team gets lots of public funding as well, and then turns around and soaks up lots of consumer surplus with PSLs. They want to keep some good will, though this is hard to show in practice. Whatever the reason, it invalidates the direct comparison of Steelers' prices for PSLs and the prices on the secondary market.

In 2001, when the Steelers sold their PSLs for Heinz, it is VERY likely that the market would have paid significantly more than what they were going for. One could verify this by looking at 2001 sales of PSLs on the secondary market. While this still made them a great investment for the savvy ticket seller, it means that the demand for PSLs likely did not increase fourteen-fold in the past 10 years. One would have to compare the secondary market for these then to the secondary market for them now.

The second thing they didn't do (at least they make no indication of it) was adjust for inflation. This again makes the numeric comparison invalid. 2011 dollars are worth less than 2001 dollars. Let's use 2010 dollars as an example. They're worth about 81% of what 2001 dollars were. So you have to first discount this amount before making a real comparison of value or changes in demand.

All in all, it's likely that the Steelers have seen some great financial return on their success over the past 10 years. However, it's just not nearly as large as the Pittsburgh Post-Gazette seems to want to indicate. I guess those two losses to the Ravens this year left them looking for something positive to report on (*wink* *wink*).

Saturday, October 29, 2011

Maximizing Sabermetric Visual Content: Smooth Comparisons and Leveraging Color

A recent post by Mike Fast got me thinking a bit more about color. For most, thoughts about color generally become a secondary interest. But I am here to tell you they should be a primary concern in your statistical presentations. This is especially true when analyzing the strike zone.

Before you begin reading this, please read Mike's excellent post over at Baseball Prospectus. Then, go ahead and read this article at Praiseball Bospectus (linked not because of its title--I am very glad Dave Allen was born--but because it really does highlight some issues with things you'll find around the net).

Okay, now that you have read that, here are my additional comments. First, heat maps should be approached with caution. This is true whether or not you are smoothing or simply breaking up the zone into smaller areas. Mike covers this well, but I will take it a little further with smoothing.

When you use a smoothing technique, you really need to understand what it is doing. I'm not going to fully describe loess techniques (or smoothing splines, or kernel density functions, etc.). There are plenty of resources online. Often times the degree of smoothing is up to the researcher. However, it is almost always the case in baseball analysis that we want to compare one smoothed representation or heat map to another one. This is where things get tricky. You'll need to make sure you are not oversmoothing (too wiggly) or undersmoothing (not wiggly enough).

Sample size is of course the first issue. If you are going to present BABIP by pitch location for a single batter or pitcher, you are likely going to need to regress the data a lot. Pitch data are extremely noisy. Secondly, you really need to account for the fact that the batter CHOSE to swing at those pitches. The pitches that a batter swings at is a distribution nested within all pitches thrown. Then, the pitches that are contacted with are yet another subset of this distribution. Sometimes, this is no big deal. Other times, it is an extremely big deal.

Let's ignore the second issue above and focus first on sample size and comparisons. Next, let's restrict ourselves to evaluating the likelihood of a hit on contacted pitches. Let's say one batter has made contact with about 1600 pitches, while another has made contact with about 250 pitches in our sample. Let's just ignore the 'regression to the mean' issue here as well. You know what, let's make it the same batter in both cases, with the second being a random sample of the first bunch. If we use exactly the same smoothing parameters for each (with no restriction for the distribution being binomial, which technically it should be--more thoughts on this issue here) we will get the following (extremely rough, and somewhat ugly--keep in mind I am not regressing here, just showing what happens with the different sample sizes) comparison below:























Because I have not restricted the data to be between 0 and 1, just assume the white splotches are where the probability of a hit on a ball in play is 0% (i.e. white==really really cold zone--I will leave aside the VERY IMPORTANT issue of ensuring the same color scaling on the sidebar for another post!). You can see above that, even though we're looking at the same player, the maps are very different. There are likely many problems here, as we would not expect pitches low and down the middle (remember, this is all within the strike zone) to be almost a 0% chance of a hit. Why? Well, the player above is Albert Pujols. Plus, when we look at the full data on balls in play, we see that the probability is closer to what we would expect (though, according to this data, still a cold zone).

You can also see that one plot shows a hot zone on the outer half, while the subset shows hot zones up and in as well as at the bottom of the zone. This is a result of having very little data in these areas, and it is ultimately overweighted with the given smoothing parameter. If Pujols gets one hit out of two pitches at the knees, it reports his BABIP to be .500 if we do not smooth enough or weight it along with other player data. Of course, we wouldn't expect him to have a .500 BABIP in the future on these pitches. Throw him 1000 pitches there to swing at, and he is really not likely to get 500 hits.

So, with the same smoothing parameters, these plots really are not comparable to one another.

Now, we could reconsider the smoothing parameter for the smaller data set (probably a good idea!). However, the problem is that we don't know at what point of smoothing we're overfitting or underfitting. You can imagine the problem is much more difficult when we are comparing two players against one another.

One way to attack this issue is through using a generalized cross-validation technique (this can be done with the "mgcv" package in R). Using this method, I have found that we need a large sample size of pitches. The method really breaks down for the small subset; however, it allows not only for a binomial representation of the data (rather than smoothing it with an assumed Gaussian distribution), but also to optimize the smoothing parameter to compare across different sample sizes and distributions of pitches and BABIP.

Okay, I could go on for a loooooong time and get really "mathy" with the considerations I mention above. However, I'll just point everyone toward the book on Generalized Additive Modeling by Simon Wood (2006). It is honestly one of the best resources I have ever come across in statistics, but to implement this with Pitch F/X you generally need a pretty large data set. One needs to be careful and be sure to fully understand all of the options that can be implemented. Using this method with the right type of data, you can ultimately create something like this (a strike zone map):

Before I get too far off on a tangent, let's return to the initial point of this post: COLOR. For this, I'll stick with strike zone maps.

The first question is: Why in the hell would we want to use color anyway?
The answer is: It can help to communicate muddy scatterplots more easily.

For example, below we have three scatterplots: Called Balls, Called Strikes, and the two combined. It is easy to tell where the definite strikes and definite balls are, but when we overlap the two plots, the strike call likelihood becomes nearly nearly uninterpretable at the edges.


There is another consideration--noted by J-Doug--is color blindness. The Green-to-Red plots for BABIP are likely a poor choice (as are the scatter plots shown above!). Many people (about 8% of males) are unable to discern greens and reds. So using these within the same image is a bad idea. One way to evaluate your colors is to see if they are interpretable in black and white. Let's check out the strike zone plot in black and white:


With the colors I use, it seems that for someone with complete color blindness, I have failed this test. However, with some knowledge of a strike zone, this person would be able to understand that the dark within the zone is high strike probability, while the dark outside the zone is low strike probability. They are also able to find that spot where the strike probabilities are changing the most (but this likely isn't satisfactory). Which brings me to my next consideration...

Color is an important factor of your visual depending on what you want to highlight to the reader. In the first heat maps, we may want to be able to read the smoothing across the strike zone at a very granular level. However, for the strike zone map above, we may be more interested in the place where the likelihood of a pitch being called a ball becomes higher than the likelihood of the pitch being called a strike (here, the yellowish-whitish band).

When considering interest in the gradual changes across a heatmap, I find it a good idea to use a single color. This way, there is not this "breakpoint" from red-to-blue or from green-to-yellow. The same color gets lighter and lighter as you go. Below I have an example of using the same color for determining densities of called strike locations (i.e. where they are thrown least or most). Here, I use a "red-to-white" palette and then switch it to "white-to-red".


I would love to get comments on others' opinions about and experiences with using color. I have not gone too in-depth, but I hope to follow up with a number of examples of color use for the same image and how this can allow highlighting certain areas of a visual. Also, with feedback, we can try and develop a consensus on what the optimal choices are for the majority.

Thursday, October 27, 2011

Article at JQAS: Baseball Hall of Fame Voting

The newest issue of the Journal of Quantitative Analysis in Sports has been published today, and it features a number of interesting articles. In honor of shameless self-promotion, I would like to highlight the following article:

Using Tree Ensembles to Analyze National Baseball Hall of Fame Voting Patterns: An Application to Discrimination in BBWAA Voting
Brian M. Mills and Steven Salaga

The link above should be un-gated. If it is not, please let me know and I can share the article. This is my first, first-author academic publication so go easy on me (and Steve). If you read my recent post about our joint poster at the 2011 Joint Statistical Meetings in Miami, this analysis should sound rather familiar. Please place questions or feedback in the comments if you have them, or feel free to shoot me an email.

We began this work a while back actually as a class project, and decided to turn it into an academic paper with some guidance and encouragement from our adviser. A paper using the same technique came out last year (Frieman, 2010) which gave us a chance to add to this work by including pitcher predictions and extending the work to the economic literature on discrimination in Hall of Fame voting. Our work differs somewhat from Frieman, and this is explained within the paper. In fact, a (very) preliminary version of the work was on this website a while back; however, after the Frieman paper was published, I was worried a bit about getting scooped even more so (no foul play there--just happened to be doing a very similar analysis at the same time).

Of course, R was used exclusively for the analysis. Also, you may note that some familiar names are cited. These include Cy Morong, Bill James, Jayson Stark, Peter Gammons, Tom Verducci, Chris Jaffe and, yes, Tom Tango (related to the Tim Raines site, of course).

If you have crtiticisms, please present them respectfully and keep in mind that we don't think this analysis (or ANY analysis) is the last word on any issue. And also keep in mind future predictions are only based on statistics as of 2009 (without career projections). So they predict future induction under the assumption of retirement after the 2009 season. But it was a lot of fun and it shows some promising results for the using technique in sports prediction. There is a lot of Hall of Fame voting literature out there, and this is another addition to it. Hopefully we can have a comprehensive model of hockey players soon now, too.

Tuesday, October 25, 2011

Sabermetrics Meets R Meetup

I just ran across this post at Big Computing. On November 14th, there will be an R User meet-up in Washington, DC (Tyson's Corner) led by Mike Driscoll about using R for sabermetric analysis (linked here). I will actually be home in Maryland for a couple weeks, and likely in DC on that Monday so there's a good chance that I will try and stop by this meet-up. If anyone else is in the area and would like to come by, let me know. I always enjoy meeting fellow statistics/sports dorks. I imagine this will be a great extension to the tutorials that I have had here, coming from someone with much more expertise in statistics and statistical programming than I.

Hat Tip: Kirk Mettler

Thursday, October 13, 2011

Insane Musings on Realignment

I back in Maryland this past weekend for a wedding and visiting my fiancee's family. Her father is a massive G-Town fan, graduate and has been on the admissions board and academic advisory committee there. He drives 2 hours each way to go to all of the basketball games. I get blasted for not screaming and cheering when I go. But it's all in good fun.

He's disgusted by the inability of the Big East to hang on to it's big name schools in recent years, and worries that Georgetown is going to have difficulty recruiting without the big name FBS schools in the conference.

This got me thinking: football is definitely a big winner, but there are a lot of basketball fans out there, too. Smaller alumni bases make it difficult to estimate a television contract, but I would not be surprised to see basketball-only schools (and perhaps Notre Dame non-football) realigning to form their own national basketball mid-major powerhouse conference. There are endless possibilities, but I see the following fitting together nicely in a conference like this (or, you could also just realign to a Catholic basketball conference with many of them):

Georgetown
Notre Dame
Villanova
St. Johns
Providence
Gonzaga
Depaul
Xavier
Marquette
Butler
St. Mary's
Temple
Duquesne
Old Dominion
Creighton
Memphis

And possibly:
Davidson
Seton Hall
George Washington
George Mason
Richmond (at the suggestion of Brian in the comments)

Obviously, this depends on whether or not schools like UConn, Louisville and West Virginia have enough clout to pull in significant conference revenue on the basketball side (perhaps Basketball and Football get some kind of package deal for the conference?). But I wouldn't be surprised to see something like this happen. Realigning so that there is still high quality competition within the conference could help all schools there recruit. Notre Dame would likely be joining a BCS level conference. Georgetown would obviously be the big wild card here on whether or not something like this happens and they may have a lot of pride, not wanting to stray from the BCS type schools. I really don't know. I think it would be fun to watch, though.

Then again, maybe (probably) it's a silly idea. What say you?

Thursday, September 29, 2011

Crediting the Rise of "Data Science" to Sabermetrics

As a graduate student in Sport Management, Statistics and Economics I am quite interested in the emerging "Data Scientist" profession. My current skills in programming are mostly limited to statistical programming in R, Stata and SPSS (I am trying to begin dabbling in SAS and Matlab more), I wish I had more skills with Python, C, SQL, Perl, Access and the like in order to scrape data myself more efficiently. I can do some basic SQL queries and read Perl script to understand *what* it's doing, but starting from scratch with these things would require a bit more free time than I have at this point in time.

I could really become more efficient in my R programming (something I continue to work on) and given the popularity of SAS outside of academia, it would be good to get familiar with advanced programming here. Unfortunately, I have never had a formal computer programming class. Most of the statistical programming has come from my own fiddling and learning statistics in classes here at Michigan. Don't get me wrong. I think I have a relatively unique and useful skill set, but there's always lots to learn and there are many other places exhibiting skills that I just don't have. And definitions of "data scientist" often include significant database management ability. I have some skills here, but they are not anywhere near those of a formally trained computer scientist or IT/data architect.

Anyway, the point of this post is to redirect readers to this presentation by Harlan Harris who talks about what "data science" really is. Why link it here? Well on the final page, Harris says the following:

"Sabermetrics was a trigger for widespread growth. Demonstrated wider applicability of stats methods, and drew attention from business."

A pretty strong quote, and one that I do agree with in some sense. Interestingly, sports have been one of the slowest to adapt to these changes in technology and ability to get into data. Harris suggests here, I think, that other businesses caught onto sabermetrics before those that the analysis was directed toward did. Pretty interesting stuff! I think the combination of open source programming and rise of blogging was the real culprit here. However, sabermetrics provided talented people with a way to apply data science to something fun and interesting. In this sense, it made it easy to communicate stories about the usefulness of data analysis in everyday business decisions.

So here's my question to those doing analysis with sports data: Would you consider yourself a "data scientist"? And if so, do you feel that full-on "hacking" skills are required to consider oneself as such? Certainly they're a plus, but can two heads (a stat-based person and a Perl-to-SQL scraper) come together and both be data scientists? Leave me something in the comments if you'd like!

Friday, September 23, 2011

IJSF Sports Economics Research Rankings

A recent paper by Jose Manuel Sanchez Santosand Pablo Castellanos Garcia in the International Journal of Sport Finance puts forth rankings of Sports Economics papers and Sports Economists. They create an index for this ranking (please refer to the paper if you're interested). Of course, there are lots of familiar names on there, but what I wanted to highlight here was the dominance (in a self-interested light, of course) of the University of Michigan Sport Management Program in the field of Sports Economics, Sport Finance and Development. Based on the rankings, we have the #1 (Stefan Szymanski), #3 (Rodney Fort), #27 (Mark Rosentraub) and #57 (Jason Winfree) academic sports economists in the world. They are all within the department. Quite a powerhouse we have here :-)

The University of Alberta comes in with Brad Humphreys (#4) and Dan Mason (#7), but they are technically in different departments there. I have had the pleasure of meeting Dr. Mason as well as another ranked economist in the paper, Joel Maxcy (who is now at Temple). I am happy to say that I have had some email contact with both Young Hoon Lee (who has helped me immensely in the econometrics programming in my dissertation) as well as JC Bradbury.

Other familiar names abound on the list, and I look forward to meeting #21 Andrew Zimablist in November when he comes to speak about Title IX. These rankings are always a fun exercise, but aren't necessarily any sort of end all at the 'best' researchers out there. However, I think there is little doubt that this is a headquarters for sports economics. Each of the professors listed above are very different, which gives us great diversity as well.

I have benefitted immensely from the structure of the department here at Michigan (as well as other departments). Much of this was luck, as I arrived at the right time when serious evolution of the faculty and program was taking place. There is no doubt that--for the quantitatively and economically inclined sport fan--this is the place to be. For those interested in other aspects of Sport Management, we have some pretty powerful faculty as well. It's really been quite a thrill to bump elbows with many of those on this list, and it's been an honor to study here in the department for going on 5 years!

Friday, September 9, 2011

Fail Post: Failure in Baseball Knowledge

A couple weeks ago on the plane back to Ann Arbor, I decided to open up Sky Mall and found the following:


I actually laughed out loud on the plane. Let's treat this as a Highlights Magazine game where you circle all the things wrong with this picture. You'd think that a well-known company like Steiner could do a little more research before putting this joke of an ad in a magazine.

Let's begin with the heading for this area: Future Stars. No complaints about Troy Tulowitzki, and Austin Jackson is reasonable. But Tulo isn't a start of the future, he's a star now. Chase Headley pushes the limit of naming someone a "Future Superstar". But I could live with that.

Answer key below:


Rick Porcello, Tigers Ace? Nearly 37 year-old RA Dickey a future star? Jeff Francouer, future star and ultimate clutch hitter? Hmmmm.

Wednesday, September 7, 2011

Link to StatDNA Guest Post

The post is officially up on the StatDNA blog. Go check it out.

As I said in my previous post, this is a very rough and preliminary model. This is why my work was not any sort of formal entry, just some fun with some great data.

I used an Vector Generalized Additive Proportional Odds Model to evaluate the change in win probability for each event listed in the StatDNA data, given the spatial location and time left in the game (as well as the score). Things turned out pretty well for this rough version and the WPA rankings are pretty close to what the EA Sports Index reports at the EPL website. Because I haven't finished the model, I won't release all of the players' WPA from last year. However, I do mention that players expected to be near the top of the list are there.

The most interesting players to me were Wayne Rooney--who finished lower than one might expect--and the up and coming goalie Tim Krul. Given that I'm more of a baseball guy, I was pretty happy with the way these things turned out. A lot of people love Krul, and this analysis seems to support that love.

Anyway, go check it out over there. Below are some fun visualizations which you may find similar to my umpire heat maps or Fangraphs Win Expectancy graphs (which you'll find at the link as well). All in all it was a lot of fun, and I'd like to thank StatDNA for letting me get dirty with the data. If you are interested in soccer, I'd definitely suggest checking them out!





Thursday, September 1, 2011

Forthcoming Guest Post at StatDNA

For those few of you that frequent this blog, you've probably noticed a scarce amount of posting lately. I've been working on a number of things, including finishing my dissertation. My adviser tells me I need to learn how to say "No" when people ask me about working on new projects, but as of yet I have not learned this well enough. Unfortunately, this has meant saying "No" a bit more to blogging.

Nevertheless, one of the projects I was working on had to do with the StatDNA competition advertised here. Dave Allen and I had planned on having some fun and putting some things together (along with some possible guidance from Soccernomics author and new Michigan Sport Management arrival, Stefan Szymanski), but alas all of us were a bit crunched on time.

Because of that, I wrote up a more simple blog post on some fiddling I had been doing with the StatDNA data (which is pretty awesome). While it did not qualify as a contest entry, the StatDNA blog will be posting it up along with the contest entrants. I'll wait for them to post, but as a preview it is the beginning of developing a sort of Wins Created metric while accounting for the spatial location of events in the game.

There is still much work to do--and this was only the preliminary model--but I found it a lot of fun and Jaeson Rosenfeld found it interesting enough to include on the blog. Once it is officially posted, I will be sure to link things here. Congratulations to the winner, Sarah Rudd, and her paper titled "Modeling Possessions in Soccer Using Markov Chains"...a paper that is likely way over my head. I look forward to reading it, though!

Tuesday, August 9, 2011

Clarifications About the JSM Poster

David Smith--from Revolutions--referred me to a criticism at Reddit regarding the poster my fellow grad student and I presented at JSM last week. This comment made me want to clarify what we are attempting in the analysis.

We ARE NOT attempting to find the most deserving players or the best players. We are attempting to use simple statistics to model the voting behavior and decision rules of those making the induction decisions. Many involved in baseball would argue that WAR is the best measure of overall player performance. I'd likely agree. But how many BBWAA voters make inductions based on that statistic (at least prior to, say, 2005)?

This is the idea we are presenting here: Hall of Fame voters are simplistic in nature when it comes to their voting. That doesn't mean they won't change, but it means that they will vote based on the information they have available. This likely includes Goals and Assists. We include Plus-Minus, but find it to be essentially useless in classification, which is probably a good thing: it shows that our model is making the decision rules correctly for this metric.

Now, I do think the thought about normalizing things like goals and assists is a valid one. It is something we are working on, but in baseball have generally found that aggregate milestones are most predictive of Hall induction. For example, using ERA+ did not improve upon the model with ERA. I'm not saying that it's the best way to go, but it seems to be the way the decision rules are made. I will double check this version of the model for hockey, of course.

Lastly, there was concern over including All-Star games in the analysis. Because there are other reasons for voting a player into the Hall--for example "integrity" is used specifically in the baseball induction requirements--the ASG totals are included in order to control for the popularity and general well-liked-ness (is that a word) of a player. We do not include it simply because we think it's a great measure of the best players. And there is certainly noise when it comes to ASG participation. The same goes for Stanley Cup Wins. But a player like Phil Rizzuto almost surely was inducted into the baseball HOF thanks to his appearance on so many World Series teams. It seems that some players are voted in based on their prominence in the media and on good teams. Again, I make no judgement as to whether or not that's the correct way to go.

I hope this clears up any confusion. Hopefully we will have a working version of the paper out in the coming months.

Monday, August 8, 2011

Request for Data (NHL Attendance)

This is a pleading, begging request for some help in collection of some data. I am working on a project looking at franchise-level hockey attendance for a chapter of my dissertation but for the life of me can't find certain years for certain teams. If anyone has the data below, I would be forever grateful to have your assistance. I need season-level attendance data by franchise.

I will even give you a mention in the acknowledgements of my dissertation so that you can live forever in print version in the dusty U of M Kinesiology dissertation library!

Anyway, below is what is needed. If you have anything, please let me know (bmmillsy AT umich DOT edu):

Boston Bruins: 1967-1971

Chicago Blackhawks: 1967-1972 and 1975-1983

Montreal Canadiens: 1967-1972 and 1975 to 1988

New York Rangers: 1967-1972 and 1975-1988

Toronto Maple Leafs: 1967-1972 and 1975-1987

And if you happen to run across it, any attendance data from before 1963, but that's not totally necessary (just always nice to have extra data). If anyone knows WHY these data are missing from just about everywhere possible, I'd also be interested in hearing that.

Thanks!

Friday, August 5, 2011

More on JSM

While my time at the 2011 Joint Statistical Meetings was short--I unfortunately missed some presentations I would have like to have attended--it was a great experience. The collection of academics and professionals is very different from the other conferences that I have attended (like Sport Management and Tourism conferences) and the interest in the methods themselves at JSM really forced me to be on my toes.

While there, I got the chance to put some faces with the names I have seen around the blogosphere. It was a pleasure to meet both Phil Birnbuam--of Sabermetric Research Blog--and David Smith--VP of Revolution Analytics Marketing and author of the Revolutions Blog. David asked about sharing my poster (joint with fellow graduate student, Steve Salaga) investigating Hockey Hall of Fame Induction using the R package "randomForest". While 'machine learning' can sound intimidating to some, Random Forests are actually quite a simple method for bootstrapping classification trees and allowing for random variable selection and a hold-out sample for each tree so that over-fitting is kept to a minimum. And what better way to implement it than with sports data!?!

**********
As a side note, this is not the first time we have implemented randomForest for sports data. Steve and I have a forthcoming paper in the Journal of Quantitative Analysis in Sports identifying patters in BBWAA voting for the Baseball Hall of Fame. Our paper is similar to a recent work by Frieman (2011) in the same journal, but we add pitchers and a discussion of exclusions based on race. As a whole, it seems that there does not seem to be any negative effect of being a minority when it comes to BBWAA voting--at least according to the method we use.
**********

So back to the Hockey Hall of Fame. For both this poster and the baseball paper, it is important to note that we are not attempting to gauge who should be in the Hall of Fame based on their performance as a player. Rather, we are attempting to gauge how well each player aligns with the views of the Hall of Fame Voting Committee and whether or not they were 'snubbed' based on how the committee would be predicted to vote. If the committee is terrible at gauging the best players, then our model will be as well. We are simply interested in the voting behavior and committee preferences, and not who the best players really are. This is an important distinction in attempting to find any exclusions based on qualitative variables like race or language, rather than attempting to rank the best players in the game.

We only include simple statistics--as we predict committee members to focus on these mostly--and goalies are not included in the analysis. Unfortunately, statistics for goalies are few and far between and the NHL has not kept Save Percentage for long enough to include in any worthwhile prediction model for goalies. Therefore, only skaters are included. We separate forwards and defensemen, but the only significant difference is the importance of Assists (they're higher for defensemen).

For example, classifying baseball player inductions on WAR or Win Shares gives us who probably should be the guys in the Hall based on their on-field performance. However, BBWAA voters do not necessarily use this metric when voting. Therefore, we want to train our data to what BBWAA voters do pay attention to. The same goes for hockey. The most important statistics for classifying players are what you would expect, and they are also presented using the Random Forest's "Variable Importance" metric.

This also allowed us to qualitatively evaluate the decision rule boundaries built by the forest and assess the possibility of certain players being discriminated against based on language. There is a line of (conflicting) economic literature--mostly in the 1980s and 1990s--that has made claims of language-based discrimination in the labor market for hockey, so we found the Hall of Fame voting to be another good test of this. Long story short, however, there does not seem to be anything systematic going on. But we leave that up to the reader, as we present each of the players near the boundaries of the decision rules from the forest.

For those interested in the full analysis, you'll have to wait for the paper. As always, there are further considerations for this sort of investigation, not the least of which include testing the RF algorithm against other classification techniques (like neural networks, discriminant analysis, simple classification trees, and others). We'll have to address those as well as other great comments from those that stopped by at the conference. However, a detailed summary of the current version is in THIS POSTER that we presented at JSM.

Thanks to all of those who stopped by. The conference was a great experience and I hope to return next year!

Friday, July 29, 2011

Joint Statistical Meetings in Miami

I am headed off to Miami for the 2011 Joint Statistical Meetings on Sunday. I'll have a poster to present with a fellow graduate student and look forward to experiencing a new conference with a very different bunch than I normally interact with professionally (though, closer to those I interact with online). If you're going to be attending, stop by the Section in Sports Contributed Poster Session and see our poster. The poster investigates Hockey Hall of Fame voting patterns (skaters only) and the possibility of language-based bias. Long story short is that we don't find much, but there is more to do and that does not necessarily mean nothing is happening.

While the meetings are for statistics in all disciplines, there is a lot on sports there. Phil Birnbaum will be presenting some of his findings with respect to race and strike calling (and there is an additional poster on the topic) and Shane Jensen will be giving a roundtable talk on fielding metrics. Check out the full sports program here.

Wednesday, July 20, 2011

Non-Sports Link: Libertarians and Progressives Can be Friends

A great article from the author of the Bleeding Heart Libertarians blog. I'll admit that there isn't a better word to describe my political views than "libertarian". I'm certainly not Milton Freidman or Jeffrey Miron--both of which I admire and respect greatly--but I can't consider myself "Bleeding Heart" either. Maybe many of my issues are with extreme--and unfortunately often uninformed--left swinging folks that I had to deal with at a very liberal undergraduate institution. But I often despair that when people think libertarian (and often times generalized to "Economists") it is unfortunate that they often think of someone with no values and little empathy. Matt Zwolinski does a great job of communicating this, and I especially like the following quote:


"These are my reasons for thinking that progressives should have greater confidence in free markets and civil society to realize their values, and less confidence in government regulation. But even if progressives are not convinced by that claim, I hope they are convinced by another one: namely, that political disagreement does not always, or even usually, imply an irreconcilable conflict of fundamental values. Progressives and libertarians should realize that they share many more values in common than they probably think, and that their different political prescriptions are less the product of an epic battle of good vs. evil and more a function of reasonable disagreement regarding how to prioritize and realize their common goals. Even if disagreement persists, bearing this point in mind should make that disagreement a more civil and productive one."

Libertarianism and moral values are not mutually exclusive. The economic prescriptions of a strictly libertarian viewpoint are an invaluable starting point to base policy. Once we have that cost-benefit and understanding of efficiency of a free market, one must turn to the values of the society and the best balance of both in order to foster both economic and societal growth. As Zwolinski says, "Good intentions, even when they exist, are not enough."

Link: http://dailycaller.com/2011/07/08/seven-reasons-progressives-should-be-more-libertarian/


Hat Tip: ECON Jeff Blog (see sidebar)

Thursday, July 14, 2011

Sam Fuld, Bob Carpenter, and Statistical Inference Blog

Here is a quick post responding to a request by Bob Carpenter at one of my favorite nerd blogs: Statistical Modeling, Causal Inference and Social Science. While a lot of the Bayesian theory is out of my league, Dr. Gelman really makes you think about some applied statistical problems in social science.

Anyway, the request was for a quick scatter plot (I'm not going to go nuts and pull out Bugs code for some Bayesian Hierarchical Model or anything like that here!) of batter performance and ability to foul balls off in given counts (I could also do base-out states, but I'll keep it simple for now).

Luckily, I had R up and running with my Pitch F/X database already in. Of course, a full analysis would require understanding where the pitches are thrown that are being fouled off (along with velocity and pitch type), but then it gets a bit complicated. Anyway, here we go. I'll start with a quick table of averages for percentage of pitches fouled off in each count (please excuse the awful table formatting here).

0-0

0-1 0-2
10.37%

17.61% 19.20%
1-0

1-1 1-2
15.55%

20.46% 22.44%
2-0

2-1 2-2
15.39%

23.30% 26.00%
3-0

3-1 3-2
2.41%

21.48% 29.91%

From this, we can glean that guys don't foul the ball much in 3-0 counts. This could be because they see easier pitches to hit and/or they're taking the pitch very often. Probably a combination of both. Keep in mind that these numbers are also biased. We don't see the same batters the same number of times in these different counts. Now for foul percent plotted against wOBA:

If anything, there's a slight downward trend here (as found before at Baseball Analysts, linked at the previous link). And finally, foul percentage plotted against wOBA for each count. Here, I removed outliers (well, outliers defined as 2 standard deviations above the average foul rate), as they should make up most of the players who did not get nearly enough at bats for the foul rates to matter. This didn't work perfectly and there are some obvious anomolies likely due to low plate-appearances, but I think we get a decent look at things. Also, the lower censoring (at 0) makes it more difficult to pick up a pattern in the plots. In addition, the plot includes player-seasons, not just players. So someone like Pujols will be in here 4 times (2007 through 2010):


It might be instructive to look at these same plots only for pitches swung at (so players aren't penalized for being selective at the plate) and/or only on pitches near the edges of the strike zone (so we're just looking at pitches that the players are fighting off). The analysis here doesn't show too much going on, but that doesn't mean there's nothing there.

Below, I've done the latter, with the same plots from above. I define the edge as 8 inches from the center of the plate and/or below 1.8 feet or above 3.3 feet vertically. Of course, you can define the edge in a number of ways. This is rough, quick code and I didn't have time to get into too much detail today:


Keep in mind this is only for Pitch F/X data. That means some of 2007, and all of the 2008 through 2010 regular seasons. I try to wait until the end of the season to update my database each year. I imagine this would be more interesting with even more years of data (like from Retrosheet, as mentioned in the linked blog post). I think Dan Turkenkopf is going to try this out, as he says in the comments. Perhaps I'll extend this later on to the swinging only as well.

Finally, one other thing to look at is whether pitchers really do get frustrated after a long string of foul balls and get burned throwing a pitch down the middle. There is probably a skill somewhere between fouling pitches off and flat out missing those pitches just because a better batter likely make contact more often. But in terms of purposefully trying to foul a pitch off--at least from my own experience playing baseball--I have doubts that guys go up there looking to 'spoil' pitches. To foul a pitch off, you have to make sure it doesn't hit the bat directly, otherwise it would go into play. Hard to believe that in and of itself would be a repeatable skill. To just edge the bat to the ball, you've got a good chance of missing it, too.

This is by no means a deep analysis, and I didn't do any sort of fantastic job at cleaning it up beforehand. Just some fun crosstabs and scatter plots.

Any thoughts from those of you reading this????

Tuesday, July 5, 2011

Forgot to Announce This

Though I'm late on this, I've been in the habit of announcing presentations of things I have been working on recently. At the WEAI conference, I am a co-author on two presentations (one of which I have put together the majority of the analysis). Unfortunately, I was unable to get funding for WEAI because I am attending a bunch of other conferences this summer, including the Joint Statistical Meetings in Miami at the beginning of August. Anyway, here are some recent presentations (they were given by Dr. Rodney Fort and Dr. Jason Winfree, respectively). You can get the full Western Economic Association International conference program right here.

Attendance Time Series and Outcome Uncertainty in the NBA, NFL, and NHL
Brian Mills and Rodney Fort

Discrimination Among MLB Umpires
Scott Tainsky, Brian Mills and Jason Winfree

The first paper simply looks at the long-run stationarity of attendance in the three leagues and assesses--at a very simple level--the influence of competitive balance (playoff, game and consecutive season uncertainty) on these attendance levels. This is part of my dissertation, and there are a number of issues to be dealt with (not the least being the censoring issue for NFL sellouts). I think this paper might bore most of the readers here--unless you're really into Lagrange Multiplier statistics for a unit root with breakpoints.

I imagine that the latter paper would be of more interest to those here. I can't divulge the entire paper (or much of it really), but we tend to find that there is very little going on in the strike-calling data with respect to umpire race. The data go back through 1996 (I think), and I update the study with some Pitch F/X analysis. There's much to do, though.

In addition to these recent presentations, my fellow graduate student Steve Salaga and I will be presenting on Language-Based Discrimination in NHL Hall of Fame Voting at the Joint Statistical Meetings. There is a whole section on sports statistics there, with a presentation by Shane Jensen on fielding metrics. It sounds like a lot of nerdy fun. For this paper, we implement a technique called Random Forests (spoiler alert, we don't find any evidence in the analysis of discriminatory behavior). This is a parallel analysis to our forthcoming paper on MLB Hall Voting Discrimination in the Journal of Quantitative Analysis in Sports. When I know the issue, I will link it here. If anyone is dying to read it, let me know.

Lastly, I would encourage anyone interested in sports statistics to attend the New England Symposium on Statistics in Sports. For those interested in soccer (futball, football), there is a soccer analytics competition being run by StatDNA. The winner gets a trip to the conference to present their paper and a $500 prize. I am currently working on some things with some people you may know, but I won't be mentioning anything until later on. It's been fun.

Okay, off to get some work done. Sorry that I have been somewhat MIA of late. Been really bogged down with a lot of different projects. Hope to get back to sab-R-metrics soon.

Wednesday, June 22, 2011

sab-R-metrics: Merging Data Sets

I am finally back from Greece and recovered from jet lag. Fortunately, I did not get tear gassed while in Athens, though there were riot police everywhere the whole time we visited. Today, I'm going to start getting my feet wet again with a shorter sab-R-metrics post to assure everyone I'm not too MIA.

Often times we have lots of data in different files that we want to link together. If you have the information in an SQL database, there are ways to match things up using R. However, I am no database management wizard and prefer to be able to look at my data in a full table format. Unfortunately, this causes problems when I want to make sure to have player names linked to the player ids in my Pitch F/X data. The issue is that the F/X data may have multiple instances or rows with the same player, while the player information file only has player ids and player names once (one per row). Doing this manually can take forever (sometimes almost literally), and we need a quick way to import player names to the correct rows. Pitch F/X tools like Joe Lefkowitz's already do this for you; however, if you have your own F/X database--or any other data with player ids that you would like to merge some data into--this tutorial should come in handy.

Luckily, R has a nice function, 'merge()', which allows for easy merging of files. While I used to use SPSS to do this, once I found the R version I'll never go back. The SPSS version is pretty handy, but extremely slow for large files and the software is outrageously expensive.

First, I want you to download a file of 5,000 pitches here. Once you have it in the correct place, load it into R and take a look at it.

#set working directory
setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")

##load pitch file
pitches <- read.csv(file="PitchesMerging.csv", h=T)
head(pitches)

As you can see, there are no player names in this file. While you could go through and add them in manually--say in Excel or something like that--this would take way too long. To get an idea of the number of names to be imported just for this small pitch file, use the following code:

##give an idea of hte amount of work that manually merging would take
length(pitches[,1])
length(unique(pitches$batter_id))
length(unique(pitches$pitcher_id))

The first line of code above tells us the number of rows in the data set--or the length of the first column in the data. This comes in handy to make sure R loaded the number of rows you expected to see. The second line of code again uses the 'length()' function, but adds a new function we have not seen yet: 'unique()'. What this does is tells us how many different/unique batter ids there are in the data set. The third line of code does the same for pitcher ids. You can also use the 'unique()' function on its own, and R will print each of the player ids within the data file (you could also assign this list or vector as an object using the assignment operator '<-'). Unique will come in handy when we get into more advanced "for loops" later on.

As you can see, there are 286 unique batter ids and 113 unique pitcher ids. In addition, there are many repeats, as there are 5,000 observations in the data file. Doing this manually would take forever. Luckily, I have a file with the player ids, the player names, player height and weight, player birth dates, and the first year played in pro ball, MLB, and the last year played in MLB. We'll use R to easily merge this into our pitch file so that we can have player names and account for height and age of the player in our analyses using the pitch data.

First, go ahead and download the file with player names and some other information here. Stick that into the same directory as the previous file and load it into R. As always, take a look at the file to make sure it loaded correctly:

##load player information file
players <- read.csv(file="detailedplayers.csv", h=T)
head(players)

Before doing any merging, we'll have to adjust some things with this file. For the 'merge()' function to work, you have to choose a variable that is contained in BOTH data sets to merge on. For our purposes, we'll use the id of the player. Unfortunately, the name of the variable is different in each file. This is an easy fix. While we're at it, it is probably a good idea to discriminate between the batter and pitcher names and information in the file, since both will be displayed in each row. So first thing is first...let's rename the variables. For this, we'll use another new function, 'colnames()'. The following code should rename everything the way we want, and we'll start by merging the new data for batters. Be sure not to omit the names of any columns or you will get an error:

##rename columns for batters
colnames(players) <- c("batter_id", "b_first", "b_last", "b_height", "b_weight", "b_birth_year", "b_pro_played_first", "b_mlb_played_first", "b_mlb_played_last")
head(players)

Always check to be sure things went correctly. There is actually an option to do this automatically in the 'merge()' function as the command "suffix=". On data sets with a large number of columns, this can save you time. But I found this to be a good time to introduce the "colnames()" function.

Now we have two files with a similar variable to match on. It's time to use the 'merge()' function. The merge function asks first for a 'x' data set (the first one), and then a 'y' data set (the second one). It is important to remember what order you place them in the function, as you will also need to tell R that you want to keep all of the original pitches in this new merged data. To save space in R--once I know things are working right--I simply reassign the merged data set as the original name 'pitches'.

To ensure that R makes a data set using all the pitches in the file, we want to use the option "all.x=T" or "all.y=T". This will tell R that the players data are just a table being used for the pitch data, while we keep all the pitch data in tact in the new merged table. Finally, we need to tell R which variable to match on using by="batter_id". Be sure to put the variable name in quotes. The following code should do this for us:

##do merge for batters
pitches <- merge(pitches, players, by="batter_id", all.x=T)
head(pitches)

Notice that it puts the "batter_id" variable in the first row of this new data set. That's okay, and you can always restructure your data if this bothers you. Now let's do the same for the pitchers in the pitch data. Don't forget to rename the variables in your player information table so that they don't overwrite the batter information, and also so that it matches on pitcher id, rather than batter id:

##rename columns for pitchers
colnames(players) <- c("pitcher_id", "p_first", "p_last", "p_height", "p_weight", "p_birth_year", "p_pro_played_first", "p_mlb_played_first", "p_mlb_played_last")
head(players)

##do merge for pitchers
pitches <- merge(pitches, players, by="pitcher_id", all.x=T)
head(pitches)

Now, looking at the data, my first row has the 69 inch, 180 pound Dustin Pedroia against a lanky 72 inch, 160 pound Miguel Bautista. For this pitch, Pedroia gets a hit. You can even double check that the players are correct by looking at the "ab_des" column, which gives a full description of what happened in the at bat. Sure enough, it says, "Dustin Pedroia singles on a line drive to left fielder Ryan Langerhans. J. Drew to 2nd.". Things seemed to have gone well here. Now, you can save the new file so you don't have to worry about merging again with the following code:

##write new table
write.csv(pitches, file="mergedpitches.csv", row.names=F)

Hopefully this will help out some of those looking to merge data together. There is much of this needed with the different data sets (pitch f/x, Retrosheet, Baseball Reference, etc.) around the web. You'll need a full mapping of all player ids. I got mine from the Universal ID Project, here is a link at The Book Blog for last year's version (I can't find the most recent link).

In the end, R's functionality here is better than any other program that I have come across. You always need to double check the data to make sure there aren't any bugs. This is especially true with even larger data. Ultimately, this can make life in R and baseball analytics about a million times easier--just be careful. There are a few things I didn't go over here (like having it automatically sort when merging), so you can always check out how to use the function yourself with the R command "help(merge)". Hope this helps!

Pretty R Code:
############################# ################Sidetrack for Merging of Data Tables #############################   #set working directory setwd("c:/Users/Millsy/Documents/My Dropbox/Blog Stuff/sab-R-metrics")   ##load pitch file pitches <- read.csv(file="PitchesMerging.csv", h=T) head(pitches)   ##give an idea of hte amount of work that manually merging would take length(pitches[,1]) length(unique(pitches$batter_id)) length(unique(pitches$pitcher_id))   ##load player information file players <- read.csv(file="detailedplayers.csv", h=T) head(players)   ##rename columns for batters colnames(players) <- c("batter_id", "b_first", "b_last", "b_height", "b_weight", "b_birth_year", "b_pro_played_first",  "b_mlb_played_first", "b_mlb_played_last") head(players)   ##do merge for batters pitches <- merge(pitches, players, by="batter_id", all.x=T) head(pitches)   ##rename columns for pitchers colnames(players) <- c("pitcher_id", "p_first", "p_last", "p_height", "p_weight", "p_birth_year", "p_pro_played_first",  "p_mlb_played_first", "p_mlb_played_last") head(players)   ##do merge for pitchers pitches <- merge(pitches, players, by="pitcher_id", all.x=T) head(pitches)     ##write new table write.csv(file="mergedpitches.csv", row.names=F)

Tuesday, June 7, 2011

Off to Greece

After being back in the U.S. for two days, I'm headed off to Athens tomorrow for another conference. It is rather small, but I couldn't pass up the chance for some practice at presenting publicly and, well, going to Greece! Though, I was relatively impressed with the Richmond Street nightlife in London, Ontario.

If you're in the area (not likely, but I know there are some international readers here), stop by. It will be at the St. George Lycabettus Hotel in Athens (they really know how to do it in Greece!). The conference is put on by ATINER and the general topic is Tourism. I am again presenting with Dr. Mark Rosentraub. Below is the title of the presentation:

Measuring the Local Economic Benefits of Regional Assets: Opportunity Costs and the Best Use of Land for Regional Development

That also means there probably won't be any sab-R-metrics articles up until after I get back (I'll return on June 16th). Hopefully I can get on a roll after that, as I only have one more conference to go to in the summer (Joint Statistical Meetings in August in Miami Beach--the Sport Sections are highly recommended for you sports guys).