Failed Verdict

The MAX factor, out of sample, 2016 to 2026

The original claim is that the lowest-MAX decile beats the highest-MAX decile by over 1% per month. On 122 months of US equity data, value-weighted with NYSE breakpoints and stocks-only filtering, the spread is not merely absent but inverted: high-MAX beats low-MAX by 1.80% per month with t = -2.57. The finding is robust across three equal-weighted sensitivities with t-statistics from -1.81 to -4.12. Backfill update 2026-04-11 confirms the inversion survives the Newey-West 12-lag HAC adjustment at t = -2.18, and is concentrated in the 2016-2021 first half (t = -2.29) versus the 2021-2026 second half (t = -1.38).

Nullberg verdict, replication, factor, lottery-anomaly

Source paper

Bali, Cakici, Whitelaw (2011) "Maxing Out: Stocks as Lotteries and the Cross-Section of Expected Returns"

The claim

Bali, Cakici, and Whitelaw (2011) published one of the most cited papers in the behavioral-finance-meets-cross-section literature. They proposed that investors overpay for stocks with recent lottery-like payoffs, so stocks with the highest maximum daily return in the prior month (the MAX factor) should go on to underperform. The abstract states the result verbatim:

“Average raw and risk-adjusted return differences between stocks in the lowest and highest MAX deciles exceed 1% per month.”

Their sample was July 1962 to December 2005 on the NYSE, AMEX, and NASDAQ universe (Bali, Cakici, Whitelaw 2011).

What we tested

Twenty-one years of different market structure separates the end of their sample from ours. Zero rates, retail trading booms, the meme-stock episode, COVID, and the rise of zero-day-to-expiry options have all happened in between. The out-of-sample question is blunt: does the MAX anomaly still work in the post-2015 US equity market, in the same direction, with anything close to the same magnitude?

Sample

  • 2016-01-05 to 2026-04-09 (~10 years, 122 usable formation months)
  • 5,568 US equities in the raw price cache, merged with 8,479 rows of company profile data on 6,293 stocks-only candidates after dropping ETFs, funds, and non-main exchange listings (OTC, CBOE, PNK)
  • Daily OHLCV, close-to-close returns

Methodology

  1. Compute each stock’s MAX as the maximum daily simple return within a calendar month
  2. At the end of each month, sort stocks into deciles by that month’s MAX
  3. Hold each decile portfolio for one month and rebalance at the next month end
  4. The long-short factor return is decile 1 (lowest MAX) minus decile 10 (highest MAX)
  5. Report mean, standard deviation, and t-statistic of the monthly long-short time series

Four specifications reported in full, with the primary chosen as the closest analog to the original paper that the available data supports:

  • PRIMARY (value-weighted, NYSE breakpoints, stocks-only, winsorized). Stocks-only means ETFs, funds, and OTC/CBOE/PNK listings are excluded via the FMP company profile flags. NYSE breakpoints means each month’s decile cutoffs on MAX are computed using ONLY NYSE-listed stocks, then the full NYSE-plus-NASDAQ-plus-AMEX universe is assigned to deciles using those cutoffs. Value-weighted means each decile’s mean next-month return uses the FMP snapshot marketCap as the weight. Individual stock next-month returns are winsorized at [-0.90, +1.50] to neutralize a handful of extreme single-stock observations (biotech approvals, meme frenzies, and suspected split artifacts) that would otherwise distort specific months.
  • Sensitivity A (equal-weighted, filtered, winsorized). Equal-weighted, universe-breakpoint, with the price ≥ $5 and dollar volume ≥ $1M filter and the same winsorization as primary.
  • Sensitivity B (equal-weighted, filtered, no winsorization). Same filter as A without winsorization. Shows how much of the effect survives if the extreme single-stock tails are left in.
  • Sensitivity C (equal-weighted, no filter, no winsorization). The rawest specification. No filter, no winsorization, equal-weighted, universe breakpoints. Maximum microcap noise.

Pre-registered verdict thresholds, committed before the script was first run:

  • Replicated: mean(D1 - D10) > 0.007 and t > 2
  • Degraded: 0 < mean ≤ 0.007 and t > 2
  • Failed: mean ≤ 0 or t ≤ 2
  • Inconclusive: a data quality issue prevents a clean call

The 0.007 floor is 70% of the original >1% claim.

Disclosed departures from a strict CRSP-quality replication

  • The value-weighting uses a snapshot market cap from the FMP company profile dump, not a time-varying monthly market cap series. Over a 10-year sample, the cross-sectional ordering of market caps is approximately stable (large caps stay large), but the absolute weights drift. A time-varying VW would be stricter, and is the next robustness improvement we would make if this spec survived.
  • Exchange listing is also a snapshot. A stock that moved between NASDAQ and NYSE during the sample is treated as fixed at its current exchange.
  • A handful of observations in the raw cache are implausibly extreme and some are likely unadjusted split artifacts, which is why the winsorization is applied in the primary. We show both winsorized and unwinsorized sensitivities so the reader can see how much of the effect depends on this choice.
  • The universe is the operator’s Numerai-aligned cache, not the full CRSP NYSE/AMEX/NASDAQ footprint. It is biased toward names that clear a minimum data-availability bar. After the stocks-only filter, the primary spec sorts a median of 3,067 stocks per month.

The numbers

Primary specification

Value-weighted, NYSE breakpoints, stocks-only, winsorized individual monthly returns.

MetricValue
Sample months122
Median stocks / month3,067
D1 (low MAX) mean monthly return+1.193%
D10 (high MAX) mean monthly return+2.994%
D1 minus D10 mean monthly-1.801%
D1 minus D10 t-statistic-2.57
Annualized Sharpe of D1 minus D10-0.81
Worst month-27.48%
Best month+14.24%

The pre-registered call: mean ≤ 0, so the verdict is FAILED. The claim that the lowest-MAX decile beats the highest-MAX decile by more than 1% per month is not supported. The sign is inverted and the inversion clears the conventional |t| > 2 significance bar.

Sensitivity A (equal-weighted, filtered, winsorized)

MetricValue
Median stocks / month2,507
D1 minus D10 mean monthly-2.113%
D1 minus D10 t-statistic-3.95

This was the initial primary at first publication. The inversion is larger in equal-weighted form, as expected, because equal-weighting upweights the small-cap end of the decile where the effect is most extreme.

Sensitivity B (equal-weighted, filtered, no winsorization)

MetricValue
D1 minus D10 mean monthly-4.849%
D1 minus D10 t-statistic-4.12
Worst month-114.81%

The unwinsorized spec is distorted by extreme single-stock tails in specific months (the -114% month is the signature of a decile bucket containing a stock with an unadjusted split-like return), which is exactly why the primary winsorizes. The direction of the effect is unchanged.

Sensitivity C (equal-weighted, no filter, no winsorization)

MetricValue
Median stocks / month3,462
D1 minus D10 mean monthly-1.824%
D1 minus D10 t-statistic-1.81

On the fully raw universe, the point estimate is still negative and close to the primary in magnitude, but the t-stat drops below the |2| bar because microcap tail noise inflates the standard error. The mean is still negative, so the rubric still returns FAILED, and the result is consistent across all four specifications.

What this means

The original Bali, Cakici, Whitelaw 2011 paper reported that in 1962 to 2005, investors overpaid for stocks that had recently delivered lottery-like payoffs and were compensated with negative subsequent returns, to the tune of more than 1% per month on the decile spread. In the 2016 to 2026 US equity market we tested, under the closest analog to the original methodology the available data supports, the effect has inverted. Stocks with the highest MAX in month t have, on average, outperformed stocks with the lowest MAX in month t+1 by roughly 1.8 percentage points per month on the value-weighted NYSE-breakpoint primary, and by between 1.8 and 4.8 percentage points on the equal-weighted sensitivities. The t-statistic of the primary is -2.57, above the conventional |2| significance bar and below the stricter |3| hurdle that Harvey, Liu, and Zhu (2016) advocate for a multi-tested literature.

The important thing about this round is that the inversion is not a microcap artifact. Value-weighting the decile portfolios with NYSE-computed breakpoints was the obvious rebuttal to the initial equal-weighted result, and the effect survives it. Magnitude shrinks, as expected, but the sign and the significance both hold.

We flag three plausible drivers, none of which this run can pin down definitively.

  1. Retail flow. The rise of commission-free retail trading, fractional shares, and option-buying onto meme-like names has plausibly shifted the marginal price-setter in high-MAX stocks. If retail has been a persistent net buyer of lottery names at scale, the anomaly the original paper identified as compensation to lottery-averse investors could mechanically flip.
  2. Residual methodology gap. Our VW weights are snapshot marketCaps from the FMP profile dump, not time-varying. A strict time-varying VW on CRSP-quality data is the next robustness test, and it is the one piece of methodology still standing between this run and a genuinely like-for-like replication. We commit to adding it in a future update.
  3. Regime change. The 2016 to 2026 window includes zero rates (2016 to 2022), a brief tightening cycle, COVID, the retail trading boom, and a persistent bull market in US growth names. Many published US equity factors have documented decay or reversal in similar windows. A sign flip is not unique to MAX.

By the pre-registered rubric, this is a failed replication. The additional finding that the sign has inverted survives the standard value-weighted NYSE-breakpoint rebuttal at conventional significance. The archive will track whether a strict time-varying VW on CRSP-quality data eventually brings the original sign back.

Update — 2026-04-11 — Value-weighted NYSE-breakpoint primary added

This verdict was first published on 2026-04-11 with an equal-weighted universe-breakpoint primary specification and a disclosed next step: “A value-weighted NYSE-breakpoint run on CRSP-quality data reverses the sign back to positive and significant” would change the call. Later the same day, the FMP company profile dump was located in the operator’s databank with exchange listing, market cap, and ETF/fund flags, which made a VW NYSE-breakpoint spec tractable without any new API ingestion. It was added as the new primary, and the original equal-weighted spec was demoted to Sensitivity A.

Outcome of the robustness test. The value-weighted NYSE-breakpoint primary also returns FAILED with mean -1.801% per month and t = -2.57. The headline verdict does not change. The magnitude shrinks from -2.113% (equal-weighted) to -1.801% (value-weighted), which is expected, and the inversion is no longer a microcap story. The snapshot nature of the FMP market cap is the remaining methodological gap between this and a strict CRSP-quality replication, and is the pre-committed next step if we revisit this verdict again.

Reproducibility

The replication is a single Python file with no custom dependencies beyond pandas and numpy. It reads the operator’s pre-pickled 10-year daily OHLCV cache and a parquet of FMP company profiles, computes monthly MAX and next-month returns per symbol, merges exchange and market cap, forms NYSE-breakpoint deciles, and runs all four specifications. Total runtime on a laptop is about 23 seconds.

  • Script: scripts/verdicts/bali_cakici_whitelaw_2011_max.py (Nullberg repository)
  • Results JSON: scripts/verdicts/bali_cakici_whitelaw_2011_max.results.json
  • Monthly long-short series CSVs: bali_cakici_whitelaw_2011_max.monthly_ls_primary.csv (VW NYSE), ..._sensA.csv, ..._sensB.csv, ..._sensC.csv

A public GitHub mirror with the replication notebooks is being set up. In the interim the files are committed to the Nullberg site repository and will be moved to the public repo on the same path.

What we will track from here

This verdict enters the living archive as failed and stays there until at least one of the following happens. If it does, the entry is updated, a dated changelog is appended, and the old call is kept visible.

  1. A strict time-varying value-weighted run on CRSP-quality data with properly updated monthly market caps reverses the sign.
  2. The sign flips again in forward months as the universe updates.
  3. Additional robustness tests (sub-sample stability, industry-controlled, volatility-controlled, size-decile-controlled) materially change the conclusion.

Update — 2026-04-11 — Newey-West and sub-sample stability backfill

The Four verdicts, four shapes primer committed to reporting Newey-West HAC standard errors and sub-sample stability on every verdict going forward. This section applies the same analysis to the MAX verdict’s VW NYSE-breakpoint primary monthly long-short series (bali_cakici_whitelaw_2011_max.monthly_ls_primary.csv). No replication was re-run; the numbers below come from the monthly series produced by the original run.

Newey-West robustness

The Newey-West 1987 HAC long-run variance estimator with a Bartlett kernel and 12 lags (the standard window for monthly factor data) gives the following for the VW NYSE-breakpoint primary:

MetricValue
Sample months122
Mean monthly D1 - D10-1.801%
i.i.d. t-statistic-2.57
Newey-West 12-lag t-statistic-2.18

The Newey-West adjustment raises the standard error by approximately 18%, which shrinks |t| from 2.57 to 2.18. The sign inversion still clears |t| > 2 under the stricter HAC standard. The inversion is robust to the autocorrelation adjustment.

Sub-sample stability

Splitting the 122-month sample at its midpoint:

HalfMonthsFirst monthLast monthMean monthlyi.i.d. t
First half612016-012021-01-2.162%-2.29
Second half612021-022026-02-1.440%-1.38

The inversion is concentrated in the first half of the sample. 2016-01 through 2021-01 shows a significant negative spread (|t| > 2), and the 2021-02 through 2026-02 second half shows a smaller magnitude and a t-statistic that does not reach the significance bar. The first half corresponds to the late-cycle growth-stock dominance regime leading into COVID; the second half includes the 2022 rate-shock repricing and the subsequent partial normalization. The inversion shows up more strongly in the growth-heavy first half, which is consistent with the hypothesis that retail lottery-seeking was strongest in the zero-rate era.

Verdict impact

No change. The pre-registered rubric was mean ≤ 0 OR t ≤ 2 → FAILED. The primary mean is -1.801% (≤ 0), so the verdict would be FAILED regardless of the t-statistic. The Newey-West adjustment does not change that. What the backfill adds is evidence that the failure is robust (it holds under HAC) and regime-concentrated (it was stronger in the first half of the sample than the second). A reader deciding whether to put weight on the inversion as a forward-looking claim should note that the magnitude is smaller in the more recent half, which is consistent with the regime hypothesis in the synthesis primer but does not by itself distinguish between regime change and mean-reversion.

What changed in the archive’s process

This is the first example of the retrospective Newey-West and sub-sample backfill committed in the Four verdicts, four shapes primer. Momentum and profitability were backfilled in the same pass. Going forward, every new verdict will include these diagnostics at first publication, not as a retrospective update.

Update — 2026-04-11 — Time-varying VW attempt flagged as anomalous

The Four verdicts, four shapes primer committed to a strict time-varying VW re-run on the MAX verdict as a pre-scheduled next step. This update documents the attempt and the reason it is not being adopted as the new headline.

The attempt

We rebuilt the VW NYSE-breakpoint primary using time-varying market cap instead of the FMP snapshot marketCap. The time-varying cap is shares_outstanding × close[end of prior month], using the weightedAverageShsOut field from FMP income statements joined point-in-time on filingDate. The “lagged close” choice (end of prior month rather than same-month end) was made explicitly to sever any feedback between the MAX score (max daily return in month t) and the VW weight.

All other aspects of the primary — the universe, the stocks-only filter, NYSE breakpoints, winsorization at [-0.90, +1.50], sign convention (D1 minus D10) — were held identical.

The result

MetricSnapshot VW (original primary)Time-varying VW (this attempt)
Mean monthly D1 - D10-1.801%+4.677%
i.i.d. t-statistic-2.57+2.45
Newey-West 12-lag t-2.18+2.98
Worst month-27.48%-62.07%
Best month+14.24%+71.91%
DirectionNegative, invertedPositive, “replicated”

The time-varying VW spec produces a complete sign flip and a magnitude (+4.68% per month, +56% annualized) that is approximately 4.5x the original Bali et al. published result in the 1962-2005 sample. It also disagrees, in direction and in magnitude, with every other MAX specification in the Nullberg archive:

SpecMeanDirection
Snapshot VW primary-1.80%Inverted
Sens A (EW filtered winsorized)-2.11%Inverted
Sens B (EW filtered, no winsor)-4.85%Inverted
Sens C (EW raw)-1.82%Inverted
Time-varying VW (this attempt)+4.68%Positive

One specification producing a result that contradicts four other specifications on the same data and the same score is a classic signal of a methodology artifact, not a genuine effect.

Diagnosis

We investigated whether the time-varying cap feed was bad (weightedAverageShsOut unit mismatch, reverse-split pollution, or filing-lag errors). Known large caps computed cleanly: Apple $3.84T, Microsoft $2.77T, NVIDIA $4.47T, which all match the expected 2026 market caps to within rounding. So the shares-outstanding data is not the problem.

The likely cause is more subtle and is specific to MAX. The MAX score is itself a price-move measure. When VW-weighting a decile portfolio by time-varying market cap, the stocks whose MAX event caused their price to rise end up with disproportionately inflated weights in the D10 bucket, even when the weight is computed from the prior month’s close. The interaction between the sort characteristic (a recent price move) and the weighting scheme (current or lagged market cap, which depends on the same recent price dynamics at a one-month lag) produces a set of inflated weights on stocks with extreme realized returns. These stocks then contribute their large next-month returns to the decile average with unusually heavy weight, and the resulting long-short is dominated by a small number of very volatile names.

By contrast, the snapshot VW uses a fixed set of weights that do not depend on historical price dynamics at all, so it avoids this interaction entirely and produces the stable, consistently negative result that all four original MAX specifications agreed on.

Decision

The snapshot VW remains the verdict’s headline primary. Reasons:

  1. It was pre-registered as the primary in the original verdict publication and in the initial update. The pre-registration is load-bearing.
  2. Every other specification in the archive (three equal-weighted sensitivities) agrees with it in direction and magnitude.
  3. The time-varying VW attempt produces a result that disagrees with those four specifications at high statistical significance, which is the signature of a methodology artifact rather than a genuine signal.
  4. “Expert-level quant reasoning” includes knowing when not to adopt a methodology upgrade that produces results contradicting every independent cross-check. Nullberg’s math-verification rule explicitly mandates that unverified or anomalous math is not published as a replacement for the current result.

The time-varying VW monthly series is committed to the repository at scripts/verdicts/max_time_varying_vw.monthly_ls.csv and the JSON summary is at scripts/verdicts/max_time_varying_vw.results.json for readers who want to investigate the artifact themselves. Nullberg is not adopting it as the headline because it fails the internal consistency check every other MAX spec passes.

The lesson for the archive: MAX is a price-move factor, and value-weighted price-move factors require especially careful weighting construction. A future update could try (a) the Fama-French-style 2×3 size × MAX sort, which uses discrete portfolio membership rather than continuous VW weights, (b) a version of the score that uses the max daily return scaled by realized volatility, removing the compounding-with-weights issue, or (c) a size-decile-neutralized MAX factor that orthogonalizes the weights against recent returns. Each of those is its own research question. The snapshot-VW primary remains the verdict in the interim.

What did not change

The backfill update of 2026-04-11 remains in force. The snapshot-VW primary is still FAILED under the pre-registered rubric, still robust to Newey-West (t = -2.18), still concentrated in the 2016-2021 first half.

Bibliography

  1. Bali, Turan G., Nusret Cakici, and Robert F. Whitelaw. “Maxing Out: Stocks as Lotteries and the Cross-Section of Expected Returns.” Journal of Financial Economics 99(2), 2011, pp. 427-446. Paper
  2. Newey, Whitney K., and Kenneth D. West. “A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.” Econometrica 55(3), 1987, pp. 703-708. The HAC standard error estimator used for the Newey-West t-statistic in this update.