The replication crisis in systematic investing

Before Nullberg replicates a single paper, we owe the reader the answer to one question. Why does this publication exist at all? The answer is in three independent meta-studies, each published in a top-tier journal, each examining a large body of claimed alpha, and each arriving at the same conclusion from a different angle. Most published alpha does not survive contact with independent data.

The numbers below are taken verbatim from the published abstracts of each paper. Every link points to the paper itself. Nothing on this page is paraphrased beyond what is in the source, and every quantitative claim is marked with the citation that backs it.

Harvey, Liu, Zhu 2016: the multiple-testing problem

In 2016, Campbell Harvey, Yan Liu, and Heqing Zhu published a survey of the asset-pricing factor literature in the Review of Financial Studies. They collected 316 factors that had been proposed as predictors of the cross-section of expected returns and asked a single methodological question. If you are going to test hundreds of candidate factors on the same historical data, what significance threshold should a new factor clear before you believe the result?

Their answer was that the conventional threshold, a t-statistic above roughly 2.0, is far too lenient. Under a multiple-testing framework that accounts for how many factors have already been tried on the same data, a newly proposed factor should clear a t-statistic greater than 3.0 before the community treats it as real. The authors state directly that most claimed research findings in financial economics are likely false (Harvey, Liu, Zhu 2016).

That is the first piece of evidence. A t-statistic around 2, the level that gets a paper published, is not the level that justifies belief after the community has searched over hundreds of candidates on the same history.

Hou, Xue, Zhang 2020: the direct replication

In 2020, Kewei Hou, Chen Xue, and Lu Zhang did the harder version of the same exercise. They did not propose a statistical hurdle. They ran the replications themselves.

Their paper “Replicating Anomalies” in the Review of Financial Studies compiled a library of 452 anomalies drawn from the published literature and attempted to reproduce each one on a common data footing. The key adjustments they made were to use NYSE breakpoints for sorting stocks into portfolios, and to report value-weighted rather than equal-weighted returns. These adjustments matter because equal-weighted returns and non-NYSE breakpoints allow microcap stocks to drive the result, and microcap effects do not survive real-world transaction costs.

With those adjustments in place, 65% of the 452 anomalies failed to clear the single-test hurdle of |t| ≥ 1.96. In the trading frictions category specifically, 96% failed. When they imposed the more demanding multiple-test hurdle of 2.78 at the 5% significance level, 82% of the anomalies failed. And even among the ones that replicated, the authors note that the economic magnitudes are much smaller than originally reported (Hou, Xue, Zhang 2020).

That is the second piece of evidence. When someone with clean code and clean data actually runs 452 published anomalies through the same rubric, most of them fail at the level the original papers said they were significant.

McLean, Pontiff 2016: the post-publication decay

The third piece of evidence is different in shape. David McLean and Jeffrey Pontiff published their paper in the Journal of Finance the same year as Harvey Liu Zhu, and asked a specific forward-looking question. Suppose a predictor did find a real edge in the original paper’s sample. What happens after the paper is published?

They took 97 variables that had been shown in prior published work to predict cross-sectional stock returns, and measured the portfolio returns in two regimes. First, in the out-of-sample period after the original paper’s sample ended but before publication. Second, in the post-publication period after the paper became widely known.

The out-of-sample returns were 26% lower than the in-sample returns originally reported. The post-publication returns were 58% lower than the in-sample returns. The 26% figure is an upper bound on pure data-mining bias, because it is what you get without any trading-against-the-signal being possible yet. The gap between those two numbers, 32%, is their estimate of how much of the decay is attributable to publication-informed trading by other market participants (McLean, Pontiff 2016).

That is the third piece of evidence. Even predictors that survive the statistical hurdle and the replication attempt decay sharply as soon as the market knows about them.

What this means for every new claim

The three studies attack the problem from three directions. Harvey Liu Zhu attack it theoretically through the multiple-testing framework. Hou Xue Zhang attack it empirically through direct replication. McLean Pontiff attack it dynamically through post-publication decay. All three arrive at versions of the same conclusion.

A published t-statistic around 2 is not evidence of a real effect if the literature has searched over hundreds of candidates.
When an independent team actually re-runs the code with a consistent methodology, most of those results fail.
Even the ones that replicate typically decay once the market is trading against them.

A rational reader of any new alpha paper should therefore assume, as a prior, that the claim will not replicate, will not survive rigorous methodology, and will not persist after publication. The burden of proof is on the claim, not on the skeptic.

This is why Nullberg exists. Every paper we cover is coded from its published methodology, run on independently sourced historical data, and graded on a transparent four-level rubric before we write a single conclusion. The replications are public. The code is public. The verdicts are tracked, and when the evidence changes, the verdict changes with it.

The archive begins empty on purpose. The first verdict is being written now, and it will be published when the code, the data, and the writing all pass our own audit gate. Not before.

Bibliography

Harvey, Campbell R., Yan Liu, and Heqing Zhu. ”… and the Cross-Section of Expected Returns.” Review of Financial Studies 29(1), 2016, pp. 5–68. Paper
Hou, Kewei, Chen Xue, and Lu Zhang. “Replicating Anomalies.” Review of Financial Studies 33(5), 2020, pp. 2019–2133. Paper
McLean, R. David, and Jeffrey Pontiff. “Does Academic Research Destroy Stock Return Predictability?” Journal of Finance 71(1), 2016, pp. 5–32. Paper