Now, before I go on, I want to be fair to the authors. They plainly state that they are looking at performance, not innate ability. It's observational. I think the question is whether performance is actually what is of interest to researchers when doing management research, or if it is ability (moderated by effort and peers) that management is ultimately interested in. I tend to think it is the latter, though there are reasons for understanding the former (namely, the link between the two). So some of my criticisms come from interest in measurement of ability, rather than observed performance data.
I do acknowledge that this likely took plenty of time and effort to go through. And they do seem to have consulted Wayne Winston on some of the work (noted in the acknowledgements). Therefore, this post is not saying the authors are lazy, stupid, ignorant, or anything in between.
Let's begin (and note that I'm feeling all Birnbaum-y here).
First, NPR states that this is new research. It really is not, despite the fact that most of their background citations are from before 1980. This is an issue that has been discussed at length, but I'll let DiNardo and Winfree do the literature review.
THE MAIN ISSUE
Issue #1 is that they use claims that everything is normal as the justification for their paper. But this would seem to be a straw man. Why would they expect count data (specifically, low counts) bounded at 0 to be a normal distribution? I'm not sure anyone would try to assume that individual academic publications (with essentially Poisson and lambda = 2 as shown by their tables, perhaps somewhat overdispersed) would be normally distributed, would they? But they use this to test for normality of performance (actually, this is the case for most of their measurements).
I think a lot of work discussing normal distributions that they seem to be interested in--and the strawman-ish rationale for this paper--probably conflates the Central Limit Theorem with normally distributed populations. The CLT does not posit that everything is Gaussian, though some have probably said this in their past academic work, and this is often taught incorrectly in introductory statistics courses. If the authors are using this as the basis for their article, then they seem to be wasting space in what looks to be a good journal (based on impact factor).
So what is the CLT? Using the mean (average), for example, the CLT posits that the distributions of sample statistics (means) of random samples of a population will be normal (assuming it is not some weird distribution with infinite variance, etc.). So I'm not sure why they chose to compare individual scores to a normal distribution, rather than the means of a bunch of samples of those individuals.
They should have taken their (admittedly, very awesome) data and done a quick random sampling using R or something. Take the mean of each sample they take, and then build a distribution of those sample means. THEN, they should test for normality. That way, we test the applicability of the CLT to the given data, rather than testing the data to be from a distribution where the CLT won't apply. I think this gives them a much stronger hypothesis to base their tests on.
But here is the most disappointing part: They don't even test any distributions besides Gaussian and Paretian on the raw individual data. They should also be testing the Poisson and Negative Binomial (or any number of other distributions), not just Gaussian and Paretian, if the raw data is really what they're interested in. I imagine that there is some other distribution that fits this data just as well as, or better than, the power law. Or maybe not, but at least use a reasonable test. A test only for normality on this type of data, in my opinion, is not a reasonable comparison. Their test is the equivalent to saying, "Well, Barry Bonds's batting average is closer to .500 than .000, so we can conclude that he is a .500 career hitter." That kind of logic doesn't fly in my book.
I truly hope these authors don't think they are refuting the application CLT (I don't think they do, but the importance of infinite variance is that it won't apply). If their implication is that "everything has infinite variance", then I guess the implication is that we can't run any statistical tests. But they have not provided sufficient evidence for that here. They did show that the raw data probably aren't normal, but any relatively informed person with an intro statistics course could have told you that, and this seems to be inappropriate for a good journal unless it is full of uninformed papers.
We can use R to show the CLT to be the case for the Poisson with the following (extremely simple) script. All this does is take 1 million random Poisson (lambda = 2) draws and calculate the mean 5,000 times. Note that we don't need 1 million draws, nor do we need 5,000 samples to show this. But we have the computing power so why not. Then we plot it with a histogram and qqplot to see if it looks normal. The Shapiro-Wilk test is simply a formal way to test the normality (not a test I like to use much, but it exists so why not).
Of course, this assumes the data are Poisson. Given the variance parameters they have for academic publication (in the tables), there seems to be some overdispersion in some areas and underdispersion in others. However, they don't present an overall mean and variance for all publication, which by eyeballing looks like it could be pretty close to Poisson (mean=variance).
In the overdispersion case, we could use the negative binomial (or perhaps geometric) and rework our variable. Of course, it is difficult to operationalize the likelihood of getting into a journal (and this is not the same for each person), number of attempts, etc., so that's why I stuck to Poisson here.
BUT, since they have the raw data, they can just sample from that anyway so we have no reason to bother assuming a distribution. We simply need to know if it conforms to our statistical tests that are based on the CLT.
Issue #2: One thing that seems to be conflated here is the actual distribution of performance if all people were participating in a given profession, to that of observed performance of those actually in the profession. This is a contention with many sabermetricians and the work of DeVany, if I remember correctly.
Anyway, it seems to me that this paper chose some additional biased samples to evaluate. The distribution of talent itself in any given profession is not likely to be normally distributed, let alone the performance relative to those who selected in. There is selection into that occupation based on ability, especially so in those highly compensated based on observable performance. There is also a minimum wage, which keeps us from seeing the far, far left of the distribution in the U.S. even in the lowest skilled jobs. Nonetheless, even if we could see this, experience has a way of morphing the distribution and job title tends to mean some jump to the next occupation.
We also have to remember there is a bare minimum in performance allowed before getting fired (related to the minimum wage). If we have shirkers, or if there is little chance of promotion, economic theory would predict more employees to hang around doing just enough to get paid. But they don't really choose these sorts of jobs (and explicitly state that they choose heavily performance-based pay jobs for this reason), so that's a minor quibble.
Issue #3 comes from the operationalization of their variables. For example, using Academy Award Nominations has a number of problems. This is similar to using the MVP to measure the distribution in talent in baseball (and these relate to Issue #1 directly). These are rank-based. Ranks are messy in this way. We would have to expect some high random variation across acting performances for a "good" actor and "bad" actor to expect the former to be considered the "best" actor at any given point. In other words, you could have a perfectly normal distribution of acting performances, and no error in individual performance (completely deterministic), and the same exact person will get every single Academy Award every year. That seems like a strange way to test for normality. The distribution of these awards is almost certainly not normal, and we don't need to resort to a power law test to know that.
Also, I'm willing to bet there is a momentum factor with Academy Award nominations, and winning an award puts that person in the eyes of the voters more often. Therefore, all else equal, they are more likely to win the award again (my guess, though that's an empirical question). In other words, each successive award is not independent of the other. So this isn't a variable I would use to gauge performance in the first place.
Issue #4 is that they're using relative performance as a measure (touched on in #3). This is an abstraction that, admittedly, could be off due to my limited expertise in the subject. But it's not something we think about much, so I am open to comments on this.
In something like baseball, performance outcomes are invariably based on relative skill. They are not piecemeal (but the Schmidt & Hunter (1983) paper they cite as part of their rationale actually does test piecemeal work!). In this way, we can think of two variables. The first is batter skill. The second is pitcher skill. These two skills are independent of one another. The performance, however, is not independent of either of these skills. We may be able to say that performance outcomes of batters are independent of other batters, so let's do that to simplify.
Even if we do, we cannot ignore the structure of the variable of measured baseball performance in MLB. If we have two random variables of Batter Skill (X) and Pitcher Skill (Y) that we assume, innately, are normally distributed (and independent), then the observed outcome is not X, it is Z.
The problem with Z, if calculated as a ratio of two normal random variables for example (PLEASE SEE***), is that we don't know what the distribution might be (maybe Cauchy distributed, which have tendencies for outliers just based on how we operationalized it?). But this is in measured outcomes--based on sample selection bias to boot--not in ability. Perhaps some strange structure of Z is driving some of the result, but I'm not sure this is all that useful.
***Keep in mind this is an over-simplification of the performance measure. It is likely something more complicated than Z = X/Y, which means it might be some other distribution. But beyond what I have stated here, I don't have the expertise to comment. And my interpretation here could also be incorrect. The point is simply that, depending on how you define your performance variable, you could be creating something unwieldy. Perhaps that is an important lesson, but not the one they try to get at in the paper.
Issue #5 is that with actors and actresses, the independent skill level is, again, not measured. In fact, performance itself is not independent here. Better actors/actresses are more likely to be paired with better writers and better directors. When they are judged on their performance, there is an additive or multiplicative effect . A great actor in a crappy written movie with a terrible director is much less likely to receive acclaim than a great actor in a masterfully written and well-directed movie. So, you get this power distribution stemming from measure this outcome, not by measuring ability. These high skill people tend to cluster together to make outcomes different from the skill distribution. Lest we forget that there are lots of starving actor wannabe's that are probably terrible, when most of us decided long ago we wouldn't bother being an actor because we suck at it (again, selection bias here). That is not to say that someone out there who is not an actor couldn't act better than the starving actor moonlighting as a bartender. We just don't observe their acting performance, and they don't team up with other talented people in the biz.
Issue #6 they use EPL Yellow Cards as a measure for negative performance. Those who are fans of soccer know that yellow cards can occur from strategic behavior.
They also use MLB career errors (by individual player) without accounting for play time as far as I can tell. This is a big time "huh?" moment in my mind. Even if the outcomes of this strange variable follow this distribution, it doesn't mean they're unexpectedly worse than everyone. It means that they're awfully good at something else to keep them around to make those errors (i.e. an excellent hitter). It is likely that many players could fill the "error void" in the distribution had they only been better hitters. I haven't read in too much detail about all of the measures here, but this stuck out to me.
Issue #7: The authors explicitly note their "ambitious goal" to refute the idea that performance is not normal (assuming that claim is still up in the air to begin with). But they proceed with showing that they have so much data that the ambitious goal is reachable just because there is so much of it. But this is a fallacy many people make. More data is generally very good to have. But if you're not running the most useful tests on that data, then it may as well be small data.
I am sure there is more here, but I've used up enough time. Seems to me that this is another attempt at a "sexy" paper, rather than one that actually tests the distribution of the data. If they had done all this and at least tested against other possible distributions of the data, then I would probably say "interesting". But the leap from "not normal" to "power law" is a tough one to swallow when there is nothing about the in-between. Certainly, z-scores (apparently their use in performance data) can be useful for non-normal distributions without infinite variance. So why not make this clear?