... no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he [sic] rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas. --- R.A. Fisher (1959, p.42) as quoted in P Values are not Error Probabilities (Hubbard and Bayarri 2003)
ClinicalTrials.gov is probably largest public repository of scientific research in the world. It contains over 49,722 p-values.
I wanted to answer a simple question: are they clustered around 0.05?
My interest was sparked by a blog post showcasing a histogram of p-values from a handful of psychology journals.
Source: Larry Wasserman, Normal Deviate
Here's a similar histogram of p-values from the ClinicalTrials.gov database:
The two histograms look very different. The Clinical Trials p-values appear to be grouped much closer to 0. In my opinion there's not a visually apparent significant cluster like the Psychology journal-sourced p-values.
If I reduce the histogram's bins from 0.005 to 0.001 clusters around round values like 0.01, 0.02, 0.03, 0.04, 0.05, ... become apparent.
There do appear to be more values around the 0.05 space than you might expect given counts around the other rounded p-values, here are just values rounded to the hundredths digit (the spikes 0.01, 0.02, 0.3, etc).
Visually there are about 100 extra p-values with a value of 0.05 than I'd expect given the consistent decline from 0.01 through 0.10. Overall the p-values from ClinicalTrials.gov showed much less clustering than I originally expected. I have to admit I expected to see visually obvious clustering, so these p-values aren't looking bad, especailly compared to the Psychology journals.
For context, here's a view including all numerically-reported p-values,
The data for this analysis come from Aggregate Analysis of ClincalTrials.gov (AACT) database of the Clinical Trials Transformation Initiative (CTTI). You can download the data usd in this study from there: http://www.ctti-clinicaltrials.org/what-we-do/analysis-dissemination/state-clinical-trials/aact-database.
I have also hosted the lastest unzipped files (that are less than 100mb) here: https://github.com/statwonk/aact I'm hoping to recieve access to Github's Large File Storage support so I can add the rest of the files.
By no means do I present this analysis as final. I strongly believe that learning is best a collaborative effort, so if you agree and are interested, I invite Pull Requests and Isssues with feedback on Github.
A p-value is a basic expression of "rareness" of data given a Hypothesis \(H\). Notice the lack of subscript, like \(H_0\) or \(H_1\). This is intentional and crucially important.
The p-value was popularized by Ronald Aylmer "R.A." Fisher and as Hubbard and Bayarri note (2003) , to Fisher its calculated value represented, "a measure of evidence for the 'objective' disbelief in the null hypothesis; it had no long-run frequentist characteristics." To Fisher p-values are a measure of rareness of a result and given a small p-value, we are left with two possible conclusions:
"... the hypothesis is not true," (Fisher 1960, p.8)
"... an exceptionally rare outcome has occurred," (Fisher 1960, p.8)
In sharp contrast to Fisher's p-values are Jerzy Neyman and Egon Pearson's \(\alpha\) values. They (shockingly to me) are unrelated (unless one takes the extra and uninformative step of calculating the p-value of a given test statistic).
\(\alpha\) values as we are taught in statistics school are long-run probabilities of Type I errors. A Type I error is rejecting the null hypothesis when it is true --- analogous to convicting an innocent person (given innonocence is presumed). Therefore, the particulary value of any p-value is meaningless in the framework of Neyman-Pearson (\(H_0\) / \(H_1\)-style testing). With the usual hypothesis testing the only consideration is whether the p-value is less than or greater than a chosen \(\alpha\). Indeed as Hubbard and Bayarri found (p. 10) ,
Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. (Neyman and Pearson 1933, p. 291)
It's clear, Neyman and Pearson had no concern for the individual p-value only whether the in repeated sampling p-values were less than \(\alpha\)s.
This was Earth-shattering to me. For example, the canonical graduate statistics textbook, Casella and Berger has it wrong,
A p-value \(p(\bf{X})\) is a test statistic satisfying \(0\leq p(x)\leq1\) for every sample point \(\bf{x}\) Small values of \(p(\bf(X))\) give evidence that \(H_1\) is true.
... and to leave no doubt, Casella and Berger (2002) go on to say,
Furthermore, the smaller the p-value, the stronger the evidence for rejecting \(H_0\). Hence, a p-value reports the results of a test on a more continuous scale, rather than just the dichotomous decision "Accept \(H_0\)", or "Reject \(H_0\)".
So even Cassella and Berger appear to be mixing Fisher's and Neyman-Pearson's methods, despite clear, repeated and strong objections from both parties.
I'd like to stop to give all of the credit for the above insights to Raymond Hubbard and M.J. Bayarri. (Checkout the Hacker News discussion of their paper.)
Many researchers will [no] doubt be surprised by the statisticians' confusion over the correct meaning and interpretation of p values and \(\alpha\) levels. After all, one might anticipate the properties of these widely used statistical meaures would be completely understood. But this is not the case. (Hubbard and Bayarri 2003)
Originally I had thought that reported p-values like, "< 0.05" might simply be censored or truncated data. 21%
of all the p-values are reported like this. For example,
## ## < .001 < .01 < .05 < 0.0001 < 0.001 < 0.01 < 0.025 < 0.05 ## 11 3 1 179 89 3 1 21 ## < 0.55 <.00001 ## 1 5
So I have a bit of a conuundrum. It's easy to visualize numeric-coded p-values like 0.0123, but at the same time I highly doubt the authors of 10,461
p-values out of 39,261
are using Fisher's single \(H\) method (where the actual values matter). The devil's advocate might argue that the points are recorded simply to enable the second researcher to set their own \(\alpha\) --- I highly doubt it. So then should I really be focused on the acceptance and rejections?
The original and more sophisticated approach I was hoping to take was fitting a parametric model, a \(\beta\) distribution.
Larry Wasserman offers a mixed model ,
$$h(p) = (1 - \pi)u(p) + \pi g(p)$$
where \(\pi\) is the share of results rejecting the null hypothesis and accepting the alternative hypothesis. Under this model, if the null hypothesis is true, we would expect p-values \(u(p)\) to follow a uniform distribution. This made intuitive sense to me.
Indeed Casella and Berger (2002) reminded me, that we can transform any continuous random variable into a uniform random variable by way of the Probability Integral Transformation (Fisher's?). Casella and Berger state,
Hence, by the Probability Integral Transformation ..., the distribution of \(p_\theta(\bf{X})\) is stochastically greater than or equal to a uniform(0, 1) distribution. That is, for every \(0\leq \alpha \leq 1\), \(P_\theta(p_\theta(\bf{X}\leq\alpha)) \leq \alpha\). Because \(p(x) = \sup_{\theta'\in\Theta_0}p_{\Theta'}(x) \geq p_\Theta(x)\) for every x,
$$P_{\Theta} (p(\bf{X}\leq\alpha))\leq P_{\Theta}(p_{\Theta}(\bf{X})\leq X\leq\alpha)\leq \alpha$$
Holy cow. That's mighty terse and I have to bear my hands, I don't yet exactly understand why this is true. I feel like I could re-read this a million times and still be inching closer to the reason it implies p-values are uniformily distributed under the true null. As a note there is an interesting discussion of the question on StackExchange. What's interesting of course is that uniform \(p\)s are defined in terms of \(\alpha\)s (Neyman and Pearson's method). Another proof
Okay, so we have \(u(p)\) under the case that the null is true at least given \(\alpha\)s. But what about when it's false?
The data shown in the Normal Deviate blog post come from the article A peculiar prevalence of p values just below 0.05 . In the article the authors fit an exponential distribution to the data, but don't offer much justification for their decision.
Dave Giles offers a very interesting post on the matter, showing that we can at least say that the distribution of under the alternative is not uniform.
It makes intuitve sense to me that there would be no particular theoretial distribution of p-values under the assumption the alternative is true. As a researcher learns their craft, are they not more likely to run better experiements more likely to yield smaller p-values (or at least smaller than the usual \(alpha\)s?).
## Fitting of the distribution ' exp ' on censored data by maximum likelihood ## Parameters: ## estimate ## rate 3.078592
It's clear that the exponential distribution does not fit this data well.
## Fitting of the distribution ' beta ' on censored data by maximum likelihood ## Parameters: ## estimate ## shape1 0.4258880 ## shape2 0.8907368
Not bad!
A closer inspection shows that using the Beta distribution fails to produce a great fit after all. Especially around the 0.05 space of interest.
For the curious, simply throwing away the values encoded like "< 0.05" and focusing on non-censored (truncated?) observations doesn't change the fit much.
Here's a Q-Q (quantile) plot for an even clearer view (with a vertical line at the probability that represents the value 0.05),
Thought it's not an unpopular approach, I seriously doubt the utility of fitting a parametric distribution and then looking at residuals. Masicampo and Lalande^{} found that an exponential fit well, a Beta is a much better fit here, perhaps someone else will find a Gamma or some other distribution fits well. Without a theoretical underpinning to results under a true alternative hypothesis, how can we justify a parametric fit?
Larry Wasserman uses an interesting nonparametric approach based on order statistics. I played with this method a bit, but don't yet feel confident enough with it to offer its use here. If you'd like to contribute this method, it'd be appreciated at https://github.com/statwonk/clinicaltrials.
I had a lot of fun analyzing this data! There are some interesting questions that I would love to discuss further:
Is it wrong to focus on particular p-values when using an alternative hypothesis?
What share of researchers are using Fisher's single hypothesis / p-value method vs. Neyman-Pearson's \(\alpha\) method?
Will banning p-values increase the quality of scientific research?
Is there a list of common statistical textbooks that make the distinction vs. those that do not?
How can statisticians best help scientists understand the difference?
Should we remain "shoe clerks" [sic], educate, or all of the above?