Four verdicts, four shapes: reading the first Nullberg data runs

The first primer argued that most published alpha does not replicate, and cited three meta-studies that justify that prior. It is a strong prior, but it is still a prior. The archive now has four specific data runs on four specific papers, and those runs produce four qualitatively different outcomes. This primer reads the patterns in them.

The rule I am going to follow in this page is strict. I will label every claim as either evidence (something the four verdicts actually demonstrate with numbers) or hypothesis (a plausible mechanism that is consistent with the evidence but not established by it). Claims that are only hypotheses will not be presented as conclusions.

The four runs, in one table

Every verdict was run on the same 2016 to 2026 US equity universe from the operator’s 10-year price cache, with the same point-in-time fundamentals pipeline for the two accounting-based factors, with the same pre-registered rubric family (REPLICATED if mean > paper-specific floor AND t > 2, FAILED if mean ≤ 0 OR t ≤ 2), and with the same four-specification scaffold: a value-weighted NYSE-breakpoint primary plus three sensitivities including at least one equal-weighted variant.

Paper	Factor class	Primary mean monthly	Primary i.i.d. t	Verdict	Short description
Bali Cakici Whitelaw 2011	Lottery (1-month MAX)	-1.80%	-2.57	Failed	Sign inverted at significance
Jegadeesh Titman 1993	Trend (12-2 momentum)	-0.13%	-0.16	Failed	Decayed to a statistical null
Novy-Marx 2013	Quality (gross profits-to-assets)	+0.50%	+0.93	Failed	Right direction at above-canonical magnitude, but underpowered
Fama French 1992	Value (book-to-market)	+1.70%	+2.82	Replicated (regime-driven)	Full-sample rubric passes, but driven by 2022 and fails under Newey-West

What the four runs show (evidence)

These are statements the runs actually produce. Each one is a direct report of what the pre-registered specifications produced after the data was prepared and the rubric was applied.

The only factor in our four that clears the rubric is the one that was designed from accounting fundamentals held against a time-varying market denominator. Book-to-market is the only primary specification whose i.i.d. t-statistic exceeds 2 in absolute value on the paper’s predicted side. The Bali-Cakici-Whitelaw MAX factor produces a statistically significant result in the opposite direction from the paper’s prediction. Jegadeesh-Titman momentum produces an effect indistinguishable from zero. Novy-Marx profitability produces the right direction at 1.6x the canonical magnitude but with a t-statistic of only +0.93.
Every factor behaves differently on equal-weighted small-to-mid-cap universes than on value-weighted large-cap portfolios. Momentum is zero under both weightings. MAX is negatively significant under both but with larger magnitude under EW. Profitability flips sign from positive under VW to negative under EW filtered. Value drops from +1.70% under VW to +0.16% under EW. Three of the four factor classes show a clear size-dependent asymmetry, with the signal concentrated in larger stocks.
The value verdict’s replication is regime-driven, not uniform through time. Split at the sample midpoint: 2016-01 to 2021-01 shows mean +0.585% with t = +0.73, and 2021-02 to 2026-02 shows mean +2.810% with t = +3.20. 2022 alone averages +9.24% per month. If a researcher had run the same script in January 2021, they would have returned a FAILED verdict. What changed was the macro environment.
Newey-West HAC adjustment matters for the HML 2×3 construction. The value primary’s i.i.d. t of +2.82 becomes +1.94 after a 12-lag Bartlett-window Newey-West correction, because monthly HML returns are mildly autocorrelated. The simple D10-D1 sensitivity is less affected and clears |t| > 2 under both i.i.d. and Newey-West. The 2×3 construction’s tighter cross-section makes it more sensitive to the autocorrelation correction.
No factor in our four replicated without caveats. Even value, which technically clears the pre-registered rubric, does so only because of the 2022 value rebound, only on value-weighted large-cap portfolios, and only under the i.i.d. t-statistic. The MAX inversion is the single result in the archive with a clean significance-crossing finding in either direction.

What might explain the patterns (hypotheses, labeled as such)

The four observations above are empirical. What follows is not. The mechanisms below are plausible interpretations that are consistent with the data, but the data does not by itself select among them. I am flagging each as a hypothesis on purpose.

Hypothesis 1: Retail flow absorbed the lottery anomaly. The MAX factor’s sign inversion coincides with a roughly decade-long rise in commission-free retail trading, fractional shares, and option-buying on lottery-like names. If lottery preference was the original behavioral driver of the MAX anomaly, and retail flow scaled large enough to absorb and extract the forward return that sophisticated investors used to earn on the short side, the sign would mechanically flip. This is consistent with the archive’s MAX run but the archive does not contain any direct measurement of retail flow, so this is hypothesis, not evidence.

Hypothesis 2: Crowding eroded momentum. Momentum’s decay to a statistical null is consistent with the well-documented phenomenon of factor crowding, whereby a published factor attracts enough capital that the forward-expected return converges toward zero as the marginal trader sets the price. Our run is directionally consistent with that but does not measure capital flows into momentum strategies, so again hypothesis not evidence.

Hypothesis 3: Quality and value have become large-cap-only phenomena. The profitability and value verdicts both show the same shape: the effect exists under value-weighted large-cap constructions and disappears or flips sign under equal-weighted mid-cap constructions. The most parsimonious explanation is that the mechanisms capturing both factors (earnings durability and valuation dispersion) are more cleanly priced on large-cap stocks, where analyst coverage is dense and institutional ownership is higher, than on smaller stocks where noise dominates. This is directly visible in the data, but the causal story (why larger stocks price these characteristics more cleanly) is still hypothesis.

Hypothesis 4: The post-2021 regime is a macro-driven conditional mean-reversion, not a return to a stable risk premium. The value factor’s 2022 rebound is the dominant contributor to its full-sample replication. 2022 was the year the market priced in higher-for-longer rates, high-duration growth stocks sold off, and value rose in relative terms. If that dynamic reverses in a future regime, the verdict would reverse with it. The archive’s living-archive model is designed to catch exactly this, but at the moment of first publication the regime conditionality is a feature of the run, not of the factor.

What the patterns rule out

It is as important to flag what these four runs do not show as what they do.

They do not show that no published factor replicates. Value did. One data point.
They do not show that all factor failures look the same. MAX inverted, momentum decayed, profitability survived directionally, value replicated conditionally. Four qualitatively different outcomes in four runs.
They do not show that the rubric is mechanical. The rubric landed on REPLICATED for value. A mechanical-FAILED generator would not have done that.
They do not resolve the question of why any given factor worked when it did work. Nullberg reports whether a factor’s published result survives in a new sample. The mechanism is a separate question that a replication study cannot answer by construction.

What this changes in how Nullberg will run

Three decisions come out of the first four verdicts, in order of how load-bearing they are for the future of the archive.

1. Rubric additions. Starting with the next verdict, Newey-West HAC standard errors will be reported alongside the i.i.d. t-statistic on every specification. The pre-registered rubric will remain on the i.i.d. t for cross-verdict consistency, but the Newey-West number will be shown in the same row of every summary table so readers can apply the stricter standard if they prefer. Sub-sample stability splits will also be reported on every primary. The value verdict demonstrated why both of these matter.

2. Paper-specific calibration is non-negotiable. The first three verdicts used a 0.007 floor (MAX, momentum) and a 0.002 floor (profitability). The value verdict used 0.003. Each floor was calibrated to the canonical magnitude of the specific paper. The alternative, a one-size-fits-all floor, would have produced mechanically mismatched verdicts and would not have survived a review. Future verdicts will continue to calibrate floors from secondary sources that report the paper’s own canonical magnitude, and the calibration will be disclosed on the verdict page before the results are reported.

3. The archive needs more factor classes before the picture is reliable. Four verdicts is not a sample size on which to draw conclusions about the shape of the replication problem. The next round will add at least a second value variant (earnings-to-price, sales-to-price, or cash-flow-to-price), a second momentum variant (time-series momentum, residual momentum, or short-term reversal), and at least one alternative-data factor whose original claim is specifically about post-2015 data. The goal is to get to a dozen verdicts spanning at least six factor classes before the synthesis page is worth extending into a formal meta-analysis.

What comes next in the archive

The living archive model means no verdict is final. Three scheduled re-examinations:

MAX time-varying VW. The original MAX verdict used snapshot market cap for value-weighting. A time-varying monthly VW was promised in that verdict’s “what we track from here” section. Novy-Marx and value both now use time-varying VW, so the infrastructure exists. MAX will be re-run on the new pipeline.
Momentum Newey-West and sub-sample. The momentum verdict did not report Newey-West or sub-sample stability at publication. It will be updated with both, same pattern as value.
Profitability six-portfolio sort. The profitability verdict used a simple D10-D1 sort. Novy-Marx’s own paper reports value-weighted decile sorts with NYSE breakpoints and no industry controls. Our primary matches that. But recent factor research uses a 2×3 size × profitability double sort analogous to HML, and that variant is worth running on our data for cross-verdict comparability.

Each of those will produce a dated update on its corresponding verdict page when it runs, and the archive change log will show the before and after. The model is not that a verdict is a final judgment. The model is that a verdict is the current state of the evidence, and the evidence is allowed to move.

The rule, restated

Every quantitative claim on this page traces to a specific number in one of the four verdicts. Every mechanism proposed above is labeled as a hypothesis. The reader who wants to disagree with the conclusions can click through to the verdict pages, look at the code, run it, and argue with the specific number that supports or refutes the claim. That is the entire operating model of the archive.

Bibliography

Hou, Kewei, Chen Xue, and Lu Zhang. “Replicating Anomalies.” Review of Financial Studies 33(5), 2020, pp. 2019-2133. Paper
Bali, Turan G., Nusret Cakici, and Robert F. Whitelaw. “Maxing Out: Stocks as Lotteries and the Cross-Section of Expected Returns.” Journal of Financial Economics 99(2), 2011, pp. 427-446. Paper — Nullberg verdict here.
Jegadeesh, Narasimhan, and Sheridan Titman. “Returns to Buying Winners and Selling Losers.” Journal of Finance 48(1), 1993, pp. 65-91. Paper — Nullberg verdict here.
Novy-Marx, Robert. “The other side of value: The gross profitability premium.” Journal of Financial Economics 108(1), 2013, pp. 1-28. Paper — Nullberg verdict here.
Fama, Eugene F., and Kenneth R. French. “The Cross-Section of Expected Stock Returns.” Journal of Finance 47(2), 1992, pp. 427-465. Paper — Nullberg verdict here.