Tuesday, October 8, 2013

Sab-R-Metrics Data Links

For those of you still visiting this website to practice your R skills using my Sab-R-Metrics series, I want to let you know that the data sets are no longer available at the links provided in each article.  I have moved my personal website off of the University of Michigan servers (finally) now that I am at Florida, and that site had hosted the data sets for the posts.  Unfortunately, I do not have time to go back through and re-link all of the data.

Nearly all of the data I used should be publicly accessible and you can modify the code for your own data formats.  If you are dying to get a hold of the original data files, please send me an email or note in this post what you are looking for.  I will see what I can do on a person by person basis.

While posting is sparse at this blog these days, it is still something I would like to get back to in the future.  Don't think I have completely abandoned my post.

Thursday, July 18, 2013

Advanced sab-R-metrics: Parallelization with the 'mgcv' Package

Carson Sievert (creator of the really neat pitchRx package) and Steamer Projections posed a question about reasonable run times of the mgcv package on large data in R yesterday, and promised my Pitch F/X friends I would post here with a quick tip on speeding things up.  Most of this came from Google searching, so I do not claim to be the original creator of the code used below.

A while back, I posted on using Generalized Additive Models instead of the basic loess package in R to fit Pitch F/X data, specifically with a binary dependent variable (for example, probability of some pitch being called a strike).  Something went weird with the gam package, so I switched over to the mgcv package, which has now provided the basis of analysis for my 2 most recent academic publications.  I like to fancy this the best way to work with binary F/X data--but I am biased.  Note that the advantages of the mgcv package can also be leveraged in fitting other data distributions besides the binomial.  This includes the negative binomial distribution, which can be more helpful for data that are zero-inflated (probably most of the other binomial data we want to model in baseball pitches).

One advantage of mgcv is that it uses generalized cross-validation in order to estimate the smoothing parameter.  Why is this important?  Well, because we have different sample sizes when making comparisons--for example, across umpire strike zones--and we also have different variances, we might not want to fit each one the same.  Additionally, smoothing by looking at the plot until it "looks good" can create biases.  Therefore, this allows a more objective way to fit the data.  I also like the ability to fit interactions of the vertical and horizontal location variables.  If you fit them separately and additively, you end up missing out on some of the asymmetries of the strike zone.  These ultimately tend to be pretty important, with the zone tipping down and away from the batter (See the Appendix for comparison code, see below for picture of tilt; figure from this paper).

One thing that I did note on Twitter is that for the binary models, a larger sample size tends to be necessary.  My rule of thumb--by no means a hard line--is that about 5,000 pitches are necessary to make a reasonable model of the strike zone.  The is close to the bare minimum without having things go haywire and look funky like the example below, but depending on the data you might be able to fit a model with fewer observations. 

Also, if you know a good way to integrate regression to the mean in some sort of Bayesian fashion in these models, that might help (simply some weighted average of all umpire calls and the pitches called by Umpire X that does not have enough experience yet).

Because R tends to work on a single thread, instead of using all the cores on your computer, the models can become rather cumbersome.  Believe me, I know.  For a while, I was fitting models with 1.3 million pitches, 125 dummy fixed effects, and some 30 other control variables at a time for this paper.  It took anywhere from 1-3 hours, depending on whether my computer felt happy that day--and I kept forgetting to include a variable here, change something there, etc.

OK, so parallelization.  It's actually incredibly easy in the mgcv package.  You first want to know if your computer has multiple cores, and if so, how many.  You can do this through the code below (note that I first load all the necessary packages for what I want to do):

###load libraries
###see if you have multiple cores
###indicate number of cores used for parallel processing
if (detectCores()>1) {
cl <- makeCluster(detectCores()-1)
} else cl <- NULL
Created by Pretty R at inside-R.org

That last 'cl' just tells you how many cores you will be using.  Note that this leaves one of your cores ready for processing other things.  You can use all of them, but it could end up keeping you from being able to do anything else on your computer while your model is running.  You can also use less.  Simply change the second line from '-1' to '-2', or whatever you want to do.  From here, mgcv has a single command for using multiple cores.  You'll want to use the 'cl' designation as the cores to use.

One should also note that, in R, large data sets and massive matrix inversions take up a significant amount of RAM.  When I came to Florida I had to convince our IT people that I needed at least 32 GB of RAM, specifically to run the models in the paper linked above.  Running the single model got me up to 8-10 GB, while doing multiple models in a single instance in R subsequently maxed me out at around 28 GB before I closed R and opened another instance.  This is a limitation that can be addressed to some extent with mgcv, but if you're not running every single pitch available in the F/X database, you probably won't have to worry about this. 

In case you do, mgcv also has a nice option that breaks the data up into chunks and has a much lower memory footprint.  It is called bam() and works just as the gam() function does, but allows analysis on larger data sets when you have more limited memory by breaking it into chunks.  The help file claims that it can work much faster on its own in addition to saving memory.  And--most relevant to this post--this is the function that includes the option to parallelize your analyses.  The code is exactly the same with the extra command using our 'cl' defined above.  Note that I use the combined smooth and limit the degrees of freedom of the smooth to 50.  Those are, of course, choices of the modeler and dependent on the type of data you are analyzing.

###fit your model
strikeFit1 <- bam(strike_call ~ s(px, pz, k=51), method="GCV.Cp", data=called, 
   family=binomial(link="logit"), cluster=cl)
Created by Pretty R at inside-R.org

Boom.  That's it.  You can also consider fitting smooths based on handedness.  You can do one for each type of batter by breaking up the data and the modeling, or you can do the following below:

###fit model while controlling for batter handedness
strikeFit2 <- bam(strike_call ~ s(px, pz, by=factor(batter_stand), k=51) + factor(batter_stand), 
   method="GCV.Cp", data=called, family=binomial(link="logit"), cluster=cl)
Created by Pretty R at inside-R.org

And of course you can add covariates to your model that you want to estimate parametrically, such as the impact of count or pitch type:

###fit model controlling for batter handedness, count, and pitch type
strikeFit3 <- bam(strike_call ~ s(px, pz, by=factor(batter_stand), k=51) + factor(batter_stand) + 
   factor(pitch_type) + factor(BScount), method="GCV.Cp", data=called, family=binomial(link="logit"), cluster=cl)
Created by Pretty R at inside-R.org

With the model, creating figures is as easy as using the predict() function and using code as I have shown here before.  And, thanks to Carson, much of the figure production is now automated in the pitchRx package.

Note that much of my reading about this package comes from an excellent book by its creator, Simon Wood, called Generalized Additive Models: An Introduction with R.  If these models are interesting to you, this is a must have resource.

Appendix: The reason I use the interaction term is that the UBRE score is significantly better by doing so, as suggested in the previously cited text.  The code to compare the two models is also included below.  Note that your variable names and data name may differ, so change accordingly:
###Model with separate smooths
fit <- bam(strike_call ~ s(px, k=51) + s(pz, k=51), method="GCV.Cp", data=called, 
   family=binomial(link="logit"), cluster=cl)
###Model with combined smooth
fit.add <- bam(strike_call ~ s(px, pz, k=51), method="GCV.Cp", 
   data=called, familiy=binomial(link="logit"), cluster=cl)
###combined smooth UBRE score is lower
###compare models with Wald test
anova(fit, fit.add)
Created by Pretty R at inside-R.org

Friday, July 5, 2013

What is Big Data, anyway?

My graduate school advisor, Rod Fort, posed the question in this post's title on Twitter today.  I gave answers and, as he usually does, he made me think more about my answers and their precision.  Technically, what I was trying to get across was that the use of Big Data, in most cases, is terribly imprecise.  I should have been able to explain the use of the term quickly, but it took a while and a number of "well, we've always done that" from Rod.  It is thrown around a lot, and in most cases not in any meaningful way.  I got a similar reaction to my mention of a prospective certificate in Complex Systems while at Michigan (which I did not pursue--mainly because my mathematical background wasn't strong enough and I had time constraints pursuing other things).

So, assuming we want to separate the use of "Big Data" with "Analytics", I think we can amply sum up the term with the following:

Big Data describes the relationship between the ability to collect data, and the ability to do something with it.  Data is BIG at the margin at which one more unit of data would leave us unable to analyze it all with the given technological capability.

This leaves Big Data flexible for the given tool.  The growth to collect and store large amounts of data has outpaced the ability to do anything meaningful with it.  This isn't anything new.  In the same way that dynamic pricing isn't really a new idea, just a new implementation.  In the same way that analytics aren't new, just a clear recognition of the integration between statisticians, programmers, and managers in the use of the term today.  In the same way that Moneyball, the idea, isn't new.  All tend to improve over time just as any field.

When it comes to analysis of Big Data--not the term big data itself--the holy grail is to have the ability to push a button, and have the answer directly to the decision maker, what I called "streamlining" on Twitter.  But this isn't Big Data itself (and it's really a fantasy at its extreme).  Certainly we can get closer to this, but data changes, behavior changes, the world changes.  These will always have to be updated, and in many ways I don't know that Big Data and Analytics as terms are completely separable.  In this case, though, let's be specific:

Analytics is the pursuit of simplistic, streamlined statistical information in a context understandable to the decision maker.

Again, unless we believe the movie Paycheck, this won't be 100% possible.  But the fantasy idea is that the computer and its data will tell us the answer to everything.  I enjoy this quote from the Big Data article linked above:  

"May 2012 danah boyd and Kate Crawford publish “Critical Questions for Big Data” in Information, Communications, and Society...(3) Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.”

However, we can do things to ease the use of large amounts of information to make decisions.  This requires the cooperation of statisticians, programmers, and managers.  Managers need to pose the problem in a way that is tractable and understandable.  Statisticians need to know the best methods for the distribution and variability in some given set of information, and be able to communicate this back to the manager.  The programmer needs to be able to collect the data accurately--possibly with the help of many other experts--and deliver the methodology in a way that can happen in real time or as close to it as possible.  In many ways, Dynamic Pricing is an outcome of these things--but not a new idea.  Big Data commentaries and discussions are referring to closing the gap between availability and implementation.

Tuesday, June 25, 2013

Dynamic Pricing and Sports Business Models

I just began to follow the Sports Analytics Blog on Twitter.  I click through every now and again.  Interestingly, they have a multi-part piece going up currently about dynamic pricing.  However, I have some qualms with the apparent misunderstanding of business models in certain sports.  Therefore, I figured I would go ahead and use this blog for something and write a short blurb about it.

First, let's talk about dynamic pricing.  The term itself, I guess, is rather new and comes along with technology and data management improvements.  However, the idea is not new, and much of the roots of its theory come from economics.  I am going to talk about dynamic pricing here without using the words "analytics" or "moneyball", because those words just get thrown around all over the place.

Dyanmic pricing has its roots in economic theory on price discrimination and product differentiation.  Price discrimination discusses the ability of a firm to charge different prices to different people based on their willingness to pay.  However, this used to be a big problem for firms to do, as you can't just ask someone how much they want to pay for a product when they come up to the register (they'll tell you they value it at a penny, so the theory goes).  This is what car salespeople try to do when you buy a car: get you to signal to them how much you are willing to pay.  Once you do, you are toast.

But it is important to remember that price discrimination is not necessarily a bad thing (well, the European Union would like you to think so).  From a neutral standpoint--where we as good little economists do not favor the consumer or the firm--price discrimination is more efficient.  The firm no longer has to choose a single price to charge to maximize profits.  It can charge more to those willing to pay more, and less to those willing to pay less.  This probably isn't good for those who will pay more for the product, but now--at least down to marginal cost--those people that could not buy the product at the higher, single price can now purchase it.  Just as with redistribution, some are worse off and some are better off.  But if the cost of figuring out who is willing to pay what are low enough, this is much more efficient than taxation and redistribution.

Product differentiation is one way in which a firm can charge different prices to different types of consumers without asking them, but it technically differs in that it uses multiple products instead of different prices for a single product.  In this case, the firm varies the type of product it sells.  An easy story is IBM and its printers.  Simply, IBM had two types of printers, each for a different price.  One printed fast.  For the other, IBM actually put a special chip in it specifically to slow down the printing speed, despite it being the same exact printer.  They charge more the faster printers, and allow the consumers to sort themselves out by differentiating these products and letting the consumer decide how much printing they need.  Yes, the slow down chip cost a few marginal pennies for IBM, and they still went out of their way to charge less for this printer (about half, from what I can gather).  If they had sold the better printer to everyone for cheaper (so more people could buy them), they would not have maximized profits.  Additionally, more people were happy, as they could buy the cheaper, lower level printer for their basic home needs.

Interestingly, there is built-in product differentiation when buying tickets to athletic events.  Do you think the Yankees front row box seats are the same product as the upper level bleachers?  What about a Yankees game against the Red Sox game versus a game against the Royals?  Games on Saturday and games on Monday?  Teams have been doing this for a while.  Even season and single game tickets are different products.  And the dichotomy of these latter two products are the key to various business models in different leagues.

One of the important aspects of sport is the fixed supply: you can't make more seats if people want them, and you can't reduce costs by taking them away.  The stadium is built, the seats are there.  If you are not going to sell out a game, you may as well give out your tickets for free at the last minute.  If you are selling out, you can increase prices to suck up some surplus (or increase concession and merchandise prices inside the stadium--remember this is a joint maximization problem for the team).  This way, you can get concession revenues and make the stadium full (which may be more fun for the fans anyway).  Additionally, you can also charge more for in-arena/stadium advertising because you have more eyes on the ads.  In fact, teams do this.  The Detroit Pistons were handing out tickets in droves (to local youth groups, etc.) a day before game time in order to continue their sellout streak.  Maybe they should have called it a "give out" streak.  But it doesn't matter how they got into the stadium to those putting billboards inside, and the team's marginal cost of having more people in the stadium is essentially zero (give or take a few extra staff, maybe).

So where does the term "dynamic pricing" come from?  Well, this simply comes from real-time price discrimination and product differentiation.  The key here is "real-time."  Without the real time inclusion, then we're back to boring old economics.

Hotels and airlines are notorious for real-time price changes.  They know that, depending on the time of day and amount of time before the day of your stay/flight that you make your purchase, they can figure out your likely willingness to pay for a given room or seat.  Of course, this is a rough estimate, but it turns out that they are very good at this.  They do this with data they collect on all of their past sales.

More recently, teams have been delving into the use of these real time pricing models to--as the theory goes--simultaneously increase profits and allow more people into the stadium.  Those that wouldn't spend $50 on a Yankees ticket now get to buy it for $40.  Those that really wanted to attend that game bought it way in advance, and perhaps paid $60 for the same seat.  The key to the MLB business model here is that its ticket sales depend heavily on single game tickets.  The fact that there are many tickets to sell for each game--and they are not all sold preseason--allows MLB to do this.  NBA and NHL also have the ability to do this to some extent, though probably not to that of MLB.  I also heard that recently some college football teams are doing this (South Florida), but not the ones that sell pretty much all of their tickets preseason (i.e. Michigan and Florida--note that they do price discriminate through donations and student tickets).

OK, so back to Sports Analytics Blog.  What irked me a bit was this article, which gives the impression that NFL is not currently using dynamic pricing and is therefore making a poor business decision.  They use the example of the team losing a high profile player, which is fine: the Patriots games are arguably a very different product without Tom Brady on the field.  So far so good.  But if Brady is injured midseason, the team cannot suddenly change its prices!  NFL is almost a completely season ticket league (or tickets for single games purchased preseason).  Therefore, the real-time changes within the season aren't tractable, unless you are the Jacksonville Jaguars.

But, Brian, they could just hang on to them and sell them throughout the season (or keep prices super high early on), you say!?!?!

Well you are correct.  But in a short season like NFL, with huge revenue dependencies on television contracts and selling out games, the uncertainty involved in doing that may outweigh any benefit they get from pricing.  The short season allows for only 8 chances to get things right.  They let fans take the risk on purchasing the tickets and possibly seeing a down season.  That's a reasonable business decision, given the broadcasting structure and short season with lots of uncertainty (many times, pretty good teams don't make the playoffs when you have a 16 game season--think about who would make the playoffs if MLB was only 16 games).  Fans can sell off the tickets on the secondary market later if they decide they won't make the game.
To be fair, the SAB article puts things in terms of losing a player to free agency, etc.  But if that is the case, we're really not talking about anything "dynamic."  We are back to pretty basic pricing decisions.  And let's also remember this ignores economic theory in general.  Economic theory on sports says that teams choose their talent level before the season begins (with short term adjustments), based on what they know they can charge to their fans for a team with X wins (i.e. their objective function, assumed to be profit maximization).  Obviously these short term adjustments are important--like losing a player to free agency--but this isn't really dynamic.  It's just pricing.*

Now with that said, I certainly agree that teams should be considering these short term changes in talent levels if they can do so at a minimum cost (this seems likely).  But the timing of these decisions and the timing of ticket purchases in NFL as a whole would result in a relatively low use of a full-on dynamic pricing model.  The blogger at SAB, who I am sure is a sharp business person, seems to have a slight misunderstanding of dynamic pricing as real-time pricing, and of the business structure of NFL.  That doesn't mean teams shouldn't fully consider their business decisions with the information at hand.  They absolutely should!  But it doesn't make them poor businessmen for not being as quick to adopt these techniques.**

Lots can be said about sports pricing, and there is plenty of research to be done.  For now, I'll leave you with some good reading on pricing (these are gated, sorry).  Note that this is an extremely limited list of papers, and there is plenty more out there to be read (including basic texts on pricing in fixed supply industries like sports, entertainment, hotels, and airlines).

Berri, D. & Krautmann, A. (2007).  Can we find it at the concessions?  Understanding price elasticity in professional sports.  Journal of Sports Economics, 8, 183-191.

Salaga, S. & Winfree, J. (2013).  Determinants of secondary market sales prices for National Football League personal seat licenses and season ticket rights.  Journal of Sports Economics, DOI: 10.1177/1527002513477662.

Fort, R. (2004).  Sports pricing.  Managerial and Decision Economics, 25, 87-94.

Soebbing, B. & Humphreys, B. (2012).  A test of monopoly price dispersion under demand uncertainty.  Economics Letters, 114, 304-307.

*In my brief experience working with a sports ticket sales department and their analytics, there is plenty of room for improvement here.  I still can't understand how cold calling people to purchase game tickets actually work.  Has anyone ever gotten a call from a representative at their local pro sports team and been persuaded to go ahead and buy those ticket for this weekend's game?

**Now I do ignore a few interesting things about sports pricing (or at least just glaze over them).  Note that usually some sort of monopolistic firm is required for this tactic.  Otherwise, firms will bid each other down to cost.  At the very least, there would need to be differentiation of these competing products for price discrimination to happen.  Both of these conditions likely hold in pro sports.  The second is that there is a secondary market for tickets.  These are important considerations for teams, as fans selling these to other fans for more money could cut into some of the additional revenues that those fans would otherwise spend inside the stadium.  Thirdly, I ignore directly addressing the inelastic pricing of tickets across many sports.  Remember that there is a joint maximization problem (parking, merchandise, concessions, BEER) not just maximization of gate revenue--most likely dynamic pricing could be used in some sense for these other considerations in the NFL.  I also ignore the use of price dispersion, which could be an important tool for teams (especially in the data collection phase of willingness to pay in given situations).  Finally, there are interesting applications of luxury products and keeping prices high (i.e. Yankee front row box seats) and reference prices (prices that a consumer uses as a baseline for "high" or "low" price for a given product).

Tuesday, May 21, 2013

Graduate Student Research Assistant Position in Sports Economics

I am currently looking to fill an opening for a graduate student research assistant here at the University of Florida beginning in the Fall semester of 2014.  The student should have interest in topics relating to Sports Economics as well as a strong quantitative background.  The position will include tuition remission and a stipend renewable for up to four years.

The deadline for applying to this position is March 1, 2014.

For more information on this position and how to apply, please see this flyer.

Thursday, May 9, 2013

Revisiting Umpire Discrimination: New Paper at JSE

Two colleagues (Scott Tainsky and Jason Winfree) and I have a new paper just posted online at the Journal of Sports Economics.  We revisit the findings of Parsons et al. from 2011 (though, the working version of their paper caught press much earlier than this).  The paper was rather controversial and claimed important influences of umpires on game outcomes based on race.

Our paper uses a different data set and looks to replicate the findings from the original AER paper.  We were able to replicate the original findings from their provided data and code, but find odd uses of fixed effects are at the root of some of the findings.  A large majority of the paper looks at the robustness of the results, and implements Pitch F/X data to empirically derive the edge of the strike zone.  At best, the results initially presented in AER are mixed based on our analysis and re-analysis.

One thing to note is that the main interest of the Parsons et al. paper was not baseball.  The point was that detecting discrimination could be influenced by others that impact the performance of those of a given race (i.e. umpires in this context).  This point is still well taken, and makes up the most important contribution.  In fact, this is why the paper was published in the prestigious journal American Economic Review.

The link directly to the paper and abstract are below.  Unfortunately it is gated.  However, I am going to double-check my rights for including a link on my personal page (usually OK, but journals can sometimes be a pain on this issue).  If you have access, feel free to send along questions or comments to my email address or leave them in the comments.  Please make these comments and/or criticism constructive.


Saturday, May 4, 2013

Times Change, Or Why Steroids Don't Ruin Baseball for Me

Just a list of links without commentary other than this: I honestly don't care about the Steroid Debate beyond making clear how stupid it is.

Mantle Corked His Bat (insert asterisk here, right...right?)

Athletes Have Gotten Better, Mostly Without Steroids (imagine that!)

The Hall of Fame is Biased (well, I never!)

Friday, April 12, 2013

Brawling Costs Teams Money

I honestly don't know where to begin with the stupidity involved in this:


This idiocy cost the Dodgers a whole lot of money.  If I were running the team, I would at the very least seek legal counsel in order to evaluate the chance of getting some of Grienke's contract dollars back.  Yes, MLB contracts are guaranteed.  But he was injured because he assaulted someone.  Unpaid suspensions happen for PED users, so there must be some way to reconcile this.  Has there been any precedent to this sort of thing?  I don't see this behavior as assumed risk on the part of the Dodgers, though I guess one could argue that their supervisors (i.e. Mattingly) could have prevented it.

Grienke claims he did not mean to hit him.  Sure.  The catcher was set up outside and Grienke is a Cy Young contender.  Spots don't get missed like that.

Lest we forget the impact on the individual 2-1 game itself.  San Diego tied it up in that inning.  While it ended up in favor of the Dodgers, that's not something I want my players flirting with after spending over $200 million on them.

Tuesday, March 5, 2013

Refs Complicit in Fighting?

To begin, I'm not much of a hockey fan.  I just don't get it.  That doesn't mean it's not entertaining to you, to the entire population of cold-weather living people, that these guys aren't incredible athletes, or that it has no value.  It just means I don't enjoy watching it.  I watched it plenty as a kid, going to 5-10 Capitals games a year for a number of years.  I think it is an extremely interesting league from the standpoint of my academic interests.  But I never really enjoyed watching.  People say the same to me about baseball.  That's fine, it's not for everyone.  So you can take my following comments with a grain of salt if you like, or as blatant ignorance of what goes on in the sport on the ice.

Despite the idea that hockey has attempted to get rid of fighting, it is obvious to me that this is pure theater by Bettman and the owners.  In fact, given the video below, I suspect there is an explicit instruction to referees to not actually break up the fights until someone hits the ice.  (Hat Tip to Charlie Brown for the video)


These guys took up their boxing positions with everyone watching, including fellow players and referees, and not a single person bothered to try and separate them or stand between them.  In fact, the referee takes the initiative to pick up the debris (stick, etc.) and get it out of the way for when they throw down.  There is little question in my mind about the referees' complicity in these events on the ice, and I would not be surprised if they were given explicit instructions to let these things play out for the entertainment of the fans.  There is not a real safety concern for the refs or the players in breaking these two up as they are standing 10 feet from each other in their boxing poses.  This one looks almost to the point that it was staged.

McGinn had a broken orbital bone, likely having to do with his face-plant into the ice.  That is not a minor injury.   Not even close.  I know it has been said before, but if this happened in the stands someone would be on their way to prison.  This on-ice fight is no more acceptable to me than the video below, though I imagine there is more outrage there than the hockey fight.  At least in the baseball game, everyone didn't stand around and look the other way for a full 30 seconds while the batter/runner punched the pitcher in the face (Hat Tip to Tangotiger for the video below).


Note that I have the same issue with throwing at batters.  For a long time, I loved Pedro Martinez as a player, but after his many escapades with throwing at batters (not just throwing inside, but his throwing AT them and then talking about it) I no longer had any interest.  I feel the same way about Cole Hamels after the Bryce Harper beaning.  I don't think Selig did enough.  Hamels should have been suspended for the season.

Congress chastises leagues for PED use (particularly baseball for whatever stupid reasons they may have).  But why don't authorities bother with these sorts of incidents, where the league (with questionable antitrust status) is complicit in injuring its employees?  Assumed risk does not include violent assaults in any profession (and I would argue that this even includes boxing and MMA).

Let's make a comparison.  Wikipedia reports that the rate of reported aggravated assaults yearly in Detroit, Michigan is about 0.18% (1,334 assaults per 713,000 or so people).  Detroit isn't exactly a peachy place to live, in terms of crime.  In fact, the shortage of police there is becoming a huge problem.  Some calls take hours before an officer arrives at the scene.  In Los Angeles, a safer city but also a place where violent gang crime has been a serious issue in the past, there are 230 aggravated assaults per 3.84 million people.  That's a rate of .006%.

From Hockey Fight Statistics, in the 2011-2012 season (the lowest fight penalty rate since 06-07), there were 546 fights.  Give or take 700 total NHL players in a given season, we have a rate of about 78%.  That is an aggravated assault rate of 433 times the rate in the city of Detroit.  It is 13,000 times the rate in Los Angeles.

Incentives tend to work.  If you are caught breaking people's skulls in Detroit--even given the lack of police force there--you go to jail.  Same goes for LA.  The incentives against fighting in the NHL (and hitting batters in MLB) are laughable, at best.

**Note: Yes, there are probably differences in the severity of crimes that are reported in Detroit and LA, versus all "fight penalties" in hockey.  But even assuming that unreported aggravated assault in these cities is ten times what is reported, and assuming that only a quarter of NHL fights would be up to the standards of aggravated assault, the differences are still astonishing to me.

Tuesday, February 12, 2013

Employment Bias Toward Athleticism

Something I have always suspected happening in labor markets does, in fact, seem to be happening: hiring managers tend to give a premium to those signalling athletic ability or sport participation.  The paper, by Dan Olof-Rooth, looking at these results is linked below:


I think this is extremely interesting (and mirrors the studies that randomized "African American sounding names" on resumes, finding bias there).  Of course, there are different signals sent with athleticism vs. the sound of names.  A name isn't likely to signal much, maybe skin color, which we all know is not a valid way to exclude someone from a job.

However, athleticism could signal something else: motivation and time management.  My undergraduate thesis (unpublished) found that student athletes felt (self-reported) much better about their time management skills than non-athletes.  This could be a useful signal for someone hiring a prospective employee. 

Secondly, those who participate in sports tend to be more active and have more energy than those who do not.  These would also seem to be desirable skills for an employer. 

Lastly, being athletic could signal motivation or initiative from the person applying.  This is similar to participating in a club or being president of the young business leaders organization at your university.  I don't know that athletics would give an advantage above and beyond something like this, but it would seem to be at least a useful signal about involvement and social skills.  Team sports are social, and can provide opportunity to grow just as other clubs do.

All of these things are difficult to observe in an interview, so using sport participation as an implicit signal can be useful both for the employee to relay this information, and for the employer to get a bit more information about the prospective hire.  Of course, there is also the possibility of overt bias toward playing football or some other sport at a large university that the employer is a fan of.  This would not be a valid way to make a hire, but I suspect it does happen.  There is always a "buddy network" influencing many areas.

This is why I tend to always put on my CV or Resume that I participated in college athletics and currently continue to play softball and golf.  While it says nothing about my skills as an academic researcher (leaving aside the fact that I research sport), I suspect that at worst it will do nothing for me and at best make the employer slightly more interested.

What say you?

Wednesday, January 23, 2013

Data Science as a Fad?

In many ways, I didn't want to give this Forbes article a link, since it derides the idea of using data in the same ways that seemed to create the (admittedly somewhat imaginary) "scouts-vs.-stats" divide.  There is, of course, vast relevance of data science to management.  I think the article is a bit unfair to the self-created discipline, so please keep that in mind.

However, I also think there are some important points to remember.  Data scientists already engulfed in the management and operations of a given industry are invaluable.  However, data scientists with little understanding of the problems and the practical solutions specific to that industry can be dangerous.  I think this is a nice passage:

"Davenport and Patil declare that “Data scientists’ most basic, universal skill is the ability to write code.” With this pronouncement, data science fails the smell test at the very outset. For how many legitimate scientific fields is coding the most fundamental skill? The most fundamental skill for any scientist is of course mastery of a canonical body of knowledge that includes laws, definitions, postulates, theorems, proofs, and descriptions of unsolved problems. Scientists are therefore characterized by mastery of a body of knowledge, not a collection of methods. What is this body of knowledge for data science? Davenport and Patil admit there is none.

The job of scientists is to conduct independent research, contribute to a body of knowledge, and improve professional practice, while adhering to a recognized standard of conduct. Coding is a tool that facilitates some of these objectives, but is a substitute for none of them."

This point rings true in many cases.  I find myself falling into a "methods trap" in my academic work sometimes (though I try to get out of it as quickly as possible).  I know how to use R, though I am not a programmer or database manager.  I know a number of methods from statistics and econometrics.  I can turn these tools into something pretty neat.  But, I sometimes make the mistake of thinking that this is enough for researching some phenomenon.

...Then I try and write the Intro and Discussion for my paper.  Ouch.  This is amazingly difficult without reaffirming that body of knowledge about the problem at hand in the first place.  Methods and very cool visuals communicate answers.  But they are the tool to do so, not the answers themselves.  A point well-taken from the article.

Thursday, January 10, 2013

Factor Analysis with the HOF Voting

A really fun post here, which I ran across at R-Bloggers.  This is a different take from a lot of the stuff I have seen on voting.  Enjoy!

Also, Max Marchi seems to be contributing (non-sports stuff) to R-Bloggers as well. I did not know this until today.

Tuesday, January 8, 2013

Ordering of Series

A new OnlineFirst paper has come out in the Journal of Sports Economics on the ordering of 3-game series by Alex Krumer.  Have not read fully through the paper, but I am interested in seeing what is found.  Figured it would be of some interest to those visiting this site that have access to JSE.

More Power Laws

Looks like someone took DeVany (2007) and extended it to management literature and things outside of baseball (although, they don't cite him).  For a refutation of that initial paper on these power laws in sport, I'll have to link to my co-author, Jason Winfree, and his co-author, John Dinardo (http://www-personal.umich.edu/~jdinardo/lawsofgenius.pdf).

Now, before I go on, I want to be fair to the authors.  They plainly state that they are looking at performance, not innate ability.  It's observational.  I think the question is whether performance is actually what is of interest to researchers when doing management research, or if it is ability (moderated by effort and peers) that management is ultimately interested in.  I tend to think it is the latter, though there are reasons for understanding the former (namely, the link between the two).  So some of my criticisms come from interest in measurement of ability, rather than observed performance data.

I do acknowledge that this likely took plenty of time and effort to go through.  And they do seem to have consulted Wayne Winston on some of the work (noted in the acknowledgements).  Therefore, this post is not saying the authors are lazy, stupid, ignorant, or anything in between.

Let's begin (and note that I'm feeling all Birnbaum-y here).

First, NPR states that this is new research.  It really is not, despite the fact that most of their background citations are from before 1980.  This is an issue that has been discussed at length, but I'll let DiNardo and Winfree do the literature review. 


Issue #1 is that they use claims that everything is normal as the justification for their paper.  But this would seem to be a straw man.  Why would they expect count data (specifically, low counts) bounded at 0 to be a normal distribution?  I'm not sure anyone would try to assume that individual academic publications (with essentially Poisson and lambda = 2 as shown by their tables, perhaps somewhat overdispersed) would be normally distributed, would they?  But they use this to test for normality of performance (actually, this is the case for most of their measurements).

I think a lot of work discussing normal distributions that they seem to be interested in--and the strawman-ish rationale for this paper--probably conflates the Central Limit Theorem with normally distributed populations.  The CLT does not posit that everything is Gaussian, though some have probably said this in their past academic work, and this is often taught incorrectly in introductory statistics courses.  If the authors are using this as the basis for their article, then they seem to be wasting space in what looks to be a good journal (based on impact factor).

So what is the CLT?  Using the mean (average), for example, the CLT posits that the distributions of sample statistics (means) of random samples of a population will be normal (assuming it is not some weird distribution with infinite variance, etc.).  So I'm not sure why they chose to compare individual scores to a normal distribution, rather than the means of a bunch of samples of those individuals.

They should have taken their (admittedly, very awesome) data and done a quick random sampling using R or something.  Take the mean of each sample they take, and then build a distribution of those sample means.  THEN, they should test for normality.  That way, we test the applicability of the CLT to the given data, rather than testing the data to be from a distribution where the CLT won't apply.  I think this gives them a much stronger hypothesis to base their tests on.

But here is the most disappointing part: They don't even test any distributions besides Gaussian and Paretian on the raw individual data.  They should also be testing the Poisson and Negative Binomial (or any number of other distributions), not just Gaussian and Paretian, if the raw data is really what they're interested in.  I imagine that there is some other distribution that fits this data just as well as, or better than, the power law.  Or maybe not, but at least use a reasonable test.  A test only for normality on this type of data, in my opinion, is not a reasonable comparison.  Their test is the equivalent to saying, "Well, Barry Bonds's batting average is closer to .500 than .000, so we can conclude that he is a .500 career hitter."  That kind of logic doesn't fly in my book.

I truly hope these authors don't think they are refuting the application CLT (I don't think they do, but the importance of infinite variance is that it won't apply).  If their implication is that "everything has infinite variance", then I guess the implication is that we can't run any statistical tests.  But they have not provided sufficient evidence for that here.  They did show that the raw data probably aren't normal, but any relatively informed person with an intro statistics course could have told you that, and this seems to be inappropriate for a good journal unless it is full of uninformed papers.

We can use R to show the CLT to be the case for the Poisson with the following (extremely simple) script.  All this does is take 1 million random Poisson (lambda = 2) draws and calculate the mean 5,000 times.  Note that we don't need 1 million draws, nor do we need 5,000 samples to show this.  But we have the computing power so why not.  Then we plot it with a histogram and qqplot to see if it looks normal.  The Shapiro-Wilk test is simply a formal way to test the normality (not a test I like to use much, but it exists so why not).


for(i in 1:5000) {

    sampy <- rpois(10000, 2)

    distPOIS <- c(distPOIS, mean(sampy))







Created by Pretty R at inside-R.org

Of course, this assumes the data are Poisson.  Given the variance parameters they have for academic publication (in the tables), there seems to be some overdispersion in some areas and underdispersion in others.  However, they don't present an overall mean and variance for all publication, which by eyeballing looks like it could be pretty close to Poisson (mean=variance).

In the overdispersion case, we could use the negative binomial (or perhaps geometric) and rework our variable.  Of course, it is difficult to operationalize the likelihood of getting into a journal (and this is not the same for each person), number of attempts, etc., so that's why I stuck to Poisson here.

BUT, since they have the raw data, they can just sample from that anyway so we have no reason to bother assuming a distribution.  We simply need to know if it conforms to our statistical tests that are based on the CLT.

Issue #2: One thing that seems to be conflated here is the actual distribution of performance if all people were participating in a given profession, to that of observed performance of those actually in the profession.  This is a contention with many sabermetricians and the work of DeVany, if I remember correctly.

Anyway, it seems to me that this paper chose some additional biased samples to evaluate.  The distribution of talent itself in any given profession is not likely to be normally distributed, let alone the performance relative to those who selected in.  There is selection into that occupation based on ability, especially so in those highly compensated based on observable performance.  There is also a minimum wage, which keeps us from seeing the far, far left of the distribution in the U.S. even in the lowest skilled jobs.  Nonetheless, even if we could see this, experience has a way of morphing the distribution and job title tends to mean some jump to the next occupation.

We also have to remember there is a bare minimum in performance allowed before getting fired (related to the minimum wage).  If we have shirkers, or if there is little chance of promotion, economic theory would predict more employees to hang around doing just enough to get paid.  But they don't really choose these sorts of jobs (and explicitly state that they choose heavily performance-based pay jobs for this reason), so that's a minor quibble.

Issue #3 comes from the operationalization of their variables.  For example, using Academy Award Nominations has a number of problems.  This is similar to using the MVP to measure the distribution in talent in baseball (and these relate to Issue #1 directly).  These are rank-based.  Ranks are messy in this way.  We would have to expect some high random variation across acting performances for a "good" actor and "bad" actor to expect the former to be considered the "best" actor at any given point.  In other words, you could have a perfectly normal distribution of acting performances, and no error in individual performance (completely deterministic), and the same exact person will get every single Academy Award every year.  That seems like a strange way to test for normality.  The distribution of these awards is almost certainly not normal, and we don't need to resort to a power law test to know that.

Also, I'm willing to bet there is a momentum factor with Academy Award nominations, and winning an award puts that person in the eyes of the voters more often.  Therefore, all else equal, they are more likely to win the award again (my guess, though that's an empirical question).  In other words, each successive award is not independent of the other.  So this isn't a variable I would use to gauge performance in the first place.

Issue #4 is that they're using relative performance as a measure (touched on in #3).  This is an abstraction that, admittedly, could be off due to my limited expertise in the subject.  But it's not something we think about much, so I am open to comments on this.

In something like baseball, performance outcomes are invariably based on relative skill.  They are not piecemeal (but the Schmidt & Hunter (1983) paper they cite as part of their rationale actually does test piecemeal work!).  In this way, we can think of two variables.  The first is batter skill.  The second is pitcher skill.  These two skills are independent of one another.  The performance, however, is not independent of either of these skills.  We may be able to say that performance outcomes of batters are independent of other batters, so let's do that to simplify.

Even if we do, we cannot ignore the structure of the variable of measured baseball performance in MLB.  If we have two random variables of Batter Skill (X) and Pitcher Skill (Y) that we assume, innately, are normally distributed (and independent), then the observed outcome is not X, it is Z.

The problem with Z, if calculated as a ratio of two normal random variables for example (PLEASE SEE***), is that we don't know what the distribution might be (maybe Cauchy distributed, which have tendencies for outliers just based on how we operationalized it?).  But this is in measured outcomes--based on sample selection bias to boot--not in ability.  Perhaps some strange structure of Z is driving some of the result, but I'm not sure this is all that useful.

***Keep in mind this is an over-simplification of the performance measure.  It is likely something more complicated than Z = X/Y, which means it might be some other distribution.  But beyond what I have stated here, I don't have the expertise to comment.  And my interpretation here could also be incorrect.  The point is simply that, depending on how you define your performance variable, you could be creating something unwieldy.  Perhaps that is an important lesson, but not the one they try to get at in the paper.

Issue #5 is that with actors and actresses, the independent skill level is, again, not measured.  In fact, performance itself is not independent here.  Better actors/actresses are more likely to be paired with better writers and better directors.  When they are judged on their performance, there is an additive or multiplicative effect .  A great actor in a crappy written movie with a terrible director is much less likely to receive acclaim than a great actor in a masterfully written and well-directed movie.  So, you get this power distribution stemming from measure this outcome, not by measuring ability.  These high skill people tend to cluster together to make outcomes different from the skill distribution.  Lest we forget that there are lots of starving actor wannabe's that are probably terrible, when most of us decided long ago we wouldn't bother being an actor because we suck at it (again, selection bias here).  That is not to say that someone out there who is not an actor couldn't act better than the starving actor moonlighting as a bartender.  We just don't observe their acting performance, and they don't team up with other talented people in the biz.

Issue #6 they use EPL Yellow Cards as a measure for negative performance.  Those who are fans of soccer know that yellow cards can occur from strategic behavior.

They also use MLB career errors (by individual player) without accounting for play time as far as I can tell.  This is a big time "huh?" moment in my mind.  Even if the outcomes of this strange variable follow this distribution, it doesn't mean they're unexpectedly worse than everyone.  It means that they're awfully good at something else to keep them around to make those errors (i.e. an excellent hitter).  It is likely that many players could fill the "error void" in the distribution had they only been better hitters.  I haven't read in too much detail about all of the measures here, but this stuck out to me.

Issue #7: The authors explicitly note their "ambitious goal" to refute the idea that performance is not normal (assuming that claim is still up in the air to begin with).  But they proceed with showing that they have so much data that the ambitious goal is reachable just because there is so much of it.  But this is a fallacy many people make.  More data is generally very good to have.  But if you're not running the most useful tests on that data, then it may as well be small data.

I am sure there is more here, but I've used up enough time.  Seems to me that this is another attempt at a "sexy" paper, rather than one that actually tests the distribution of the data.  If they had done all this and at least tested against other possible distributions of the data, then I would probably say "interesting".  But the leap from "not normal" to "power law" is a tough one to swallow when there is nothing about the in-between. Certainly, z-scores (apparently their use in performance data) can be useful for non-normal distributions without infinite variance.  So why not make this clear?

Friday, January 4, 2013

Pitch Recognition and Neuroscience

My wife is a behavioral neuroscientist/biopsychologist (yes she is way smarter than me) and she ran across this neat paper that she forwarded my way.  I thought it would be of interest to those who still visit this website.  I will try to give some thoughts later, though I don't know much about neuroscience having only an undergraduate degree in psychology.