Thursday, September 29, 2011

Crediting the Rise of "Data Science" to Sabermetrics

As a graduate student in Sport Management, Statistics and Economics I am quite interested in the emerging "Data Scientist" profession. My current skills in programming are mostly limited to statistical programming in R, Stata and SPSS (I am trying to begin dabbling in SAS and Matlab more), I wish I had more skills with Python, C, SQL, Perl, Access and the like in order to scrape data myself more efficiently. I can do some basic SQL queries and read Perl script to understand *what* it's doing, but starting from scratch with these things would require a bit more free time than I have at this point in time.

I could really become more efficient in my R programming (something I continue to work on) and given the popularity of SAS outside of academia, it would be good to get familiar with advanced programming here. Unfortunately, I have never had a formal computer programming class. Most of the statistical programming has come from my own fiddling and learning statistics in classes here at Michigan. Don't get me wrong. I think I have a relatively unique and useful skill set, but there's always lots to learn and there are many other places exhibiting skills that I just don't have. And definitions of "data scientist" often include significant database management ability. I have some skills here, but they are not anywhere near those of a formally trained computer scientist or IT/data architect.

Anyway, the point of this post is to redirect readers to this presentation by Harlan Harris who talks about what "data science" really is. Why link it here? Well on the final page, Harris says the following:

"Sabermetrics was a trigger for widespread growth. Demonstrated wider applicability of stats methods, and drew attention from business."

A pretty strong quote, and one that I do agree with in some sense. Interestingly, sports have been one of the slowest to adapt to these changes in technology and ability to get into data. Harris suggests here, I think, that other businesses caught onto sabermetrics before those that the analysis was directed toward did. Pretty interesting stuff! I think the combination of open source programming and rise of blogging was the real culprit here. However, sabermetrics provided talented people with a way to apply data science to something fun and interesting. In this sense, it made it easy to communicate stories about the usefulness of data analysis in everyday business decisions.

So here's my question to those doing analysis with sports data: Would you consider yourself a "data scientist"? And if so, do you feel that full-on "hacking" skills are required to consider oneself as such? Certainly they're a plus, but can two heads (a stat-based person and a Perl-to-SQL scraper) come together and both be data scientists? Leave me something in the comments if you'd like!

Friday, September 23, 2011

IJSF Sports Economics Research Rankings

A recent paper by Jose Manuel Sanchez Santosand Pablo Castellanos Garcia in the International Journal of Sport Finance puts forth rankings of Sports Economics papers and Sports Economists. They create an index for this ranking (please refer to the paper if you're interested). Of course, there are lots of familiar names on there, but what I wanted to highlight here was the dominance (in a self-interested light, of course) of the University of Michigan Sport Management Program in the field of Sports Economics, Sport Finance and Development. Based on the rankings, we have the #1 (Stefan Szymanski), #3 (Rodney Fort), #27 (Mark Rosentraub) and #57 (Jason Winfree) academic sports economists in the world. They are all within the department. Quite a powerhouse we have here :-)

The University of Alberta comes in with Brad Humphreys (#4) and Dan Mason (#7), but they are technically in different departments there. I have had the pleasure of meeting Dr. Mason as well as another ranked economist in the paper, Joel Maxcy (who is now at Temple). I am happy to say that I have had some email contact with both Young Hoon Lee (who has helped me immensely in the econometrics programming in my dissertation) as well as JC Bradbury.

Other familiar names abound on the list, and I look forward to meeting #21 Andrew Zimablist in November when he comes to speak about Title IX. These rankings are always a fun exercise, but aren't necessarily any sort of end all at the 'best' researchers out there. However, I think there is little doubt that this is a headquarters for sports economics. Each of the professors listed above are very different, which gives us great diversity as well.

I have benefitted immensely from the structure of the department here at Michigan (as well as other departments). Much of this was luck, as I arrived at the right time when serious evolution of the faculty and program was taking place. There is no doubt that--for the quantitatively and economically inclined sport fan--this is the place to be. For those interested in other aspects of Sport Management, we have some pretty powerful faculty as well. It's really been quite a thrill to bump elbows with many of those on this list, and it's been an honor to study here in the department for going on 5 years!

Friday, September 9, 2011

Fail Post: Failure in Baseball Knowledge

A couple weeks ago on the plane back to Ann Arbor, I decided to open up Sky Mall and found the following:


I actually laughed out loud on the plane. Let's treat this as a Highlights Magazine game where you circle all the things wrong with this picture. You'd think that a well-known company like Steiner could do a little more research before putting this joke of an ad in a magazine.

Let's begin with the heading for this area: Future Stars. No complaints about Troy Tulowitzki, and Austin Jackson is reasonable. But Tulo isn't a start of the future, he's a star now. Chase Headley pushes the limit of naming someone a "Future Superstar". But I could live with that.

Answer key below:


Rick Porcello, Tigers Ace? Nearly 37 year-old RA Dickey a future star? Jeff Francouer, future star and ultimate clutch hitter? Hmmmm.

Wednesday, September 7, 2011

Link to StatDNA Guest Post

The post is officially up on the StatDNA blog. Go check it out.

As I said in my previous post, this is a very rough and preliminary model. This is why my work was not any sort of formal entry, just some fun with some great data.

I used an Vector Generalized Additive Proportional Odds Model to evaluate the change in win probability for each event listed in the StatDNA data, given the spatial location and time left in the game (as well as the score). Things turned out pretty well for this rough version and the WPA rankings are pretty close to what the EA Sports Index reports at the EPL website. Because I haven't finished the model, I won't release all of the players' WPA from last year. However, I do mention that players expected to be near the top of the list are there.

The most interesting players to me were Wayne Rooney--who finished lower than one might expect--and the up and coming goalie Tim Krul. Given that I'm more of a baseball guy, I was pretty happy with the way these things turned out. A lot of people love Krul, and this analysis seems to support that love.

Anyway, go check it out over there. Below are some fun visualizations which you may find similar to my umpire heat maps or Fangraphs Win Expectancy graphs (which you'll find at the link as well). All in all it was a lot of fun, and I'd like to thank StatDNA for letting me get dirty with the data. If you are interested in soccer, I'd definitely suggest checking them out!





Thursday, September 1, 2011

Forthcoming Guest Post at StatDNA

For those few of you that frequent this blog, you've probably noticed a scarce amount of posting lately. I've been working on a number of things, including finishing my dissertation. My adviser tells me I need to learn how to say "No" when people ask me about working on new projects, but as of yet I have not learned this well enough. Unfortunately, this has meant saying "No" a bit more to blogging.

Nevertheless, one of the projects I was working on had to do with the StatDNA competition advertised here. Dave Allen and I had planned on having some fun and putting some things together (along with some possible guidance from Soccernomics author and new Michigan Sport Management arrival, Stefan Szymanski), but alas all of us were a bit crunched on time.

Because of that, I wrote up a more simple blog post on some fiddling I had been doing with the StatDNA data (which is pretty awesome). While it did not qualify as a contest entry, the StatDNA blog will be posting it up along with the contest entrants. I'll wait for them to post, but as a preview it is the beginning of developing a sort of Wins Created metric while accounting for the spatial location of events in the game.

There is still much work to do--and this was only the preliminary model--but I found it a lot of fun and Jaeson Rosenfeld found it interesting enough to include on the blog. Once it is officially posted, I will be sure to link things here. Congratulations to the winner, Sarah Rudd, and her paper titled "Modeling Possessions in Soccer Using Markov Chains"...a paper that is likely way over my head. I look forward to reading it, though!