Gross profitability, out of sample, 2016 to 2026

The claim

Robert Novy-Marx’s 2013 paper in the Journal of Financial Economics is one of the most influential factor papers of the last decade. It proposes that firms with high gross profits relative to assets earn significantly higher subsequent returns than firms with low gross profits relative to assets, and that the effect has “roughly the same power as book-to-market” in predicting the cross section. The first two sentences of the abstract, verbatim:

“Profitability, measured by gross profits-to-assets, has roughly the same power as book-to-market predicting the cross section of average returns. Profitable firms generate significantly higher returns than unprofitable firms, despite having significantly higher valuation ratios.”

The score is gross profits-to-assets = (REVT - COGS) / AT, in Compustat variable names, or in the FMP data we use: grossProfit / totalAssets. Original sample: July 1963 to December 2010, NYSE/AMEX/NASDAQ with NYSE breakpoints, value-weighted decile portfolios, annual rebalancing at the end of June (Novy-Marx 2013).

Secondary sources report the canonical long-short monthly spread near 0.31% per month for the value-weighted NYSE-breakpoint specification in the original sample.

What we tested

This is the third Nullberg verdict. The first, on the MAX factor, found a statistically significant inversion. The second, on 12-2 momentum, found decay to zero. Profitability is a fundamentals-based factor from a different literature altogether. If the replication crisis is real, different factor classes should fail for different reasons. This verdict addresses whether a published anomaly grounded in accounting rather than prices survives the same post-2015 out-of-sample test.

Sample and data

2016-01-05 to 2026-04-09 daily closes, 123 usable formation months
5,568 US equities in the 10-year price cache, merged with 8,479 rows of FMP company profile data, filtered to stocks-only on main exchanges
Fundamentals: raw FMP quarterly income statements and balance sheets at databank/fmp/income_statements.parquet (496,643 rows) and databank/fmp/balance_sheets.parquet (485,364 rows)
Primary score: trailing twelve months gross profit divided by most recent quarterly total assets, computed per symbol from raw quarterly filings with point-in-time filingDate availability
No look-ahead bias: for each formation month t, the score uses only fundamentals filed on or before the last day of month t, joined via pandas.merge_asof on filingDate with direction="backward"
Median 3,097 stocks per month in the primary spec

Methodology

For each quarterly filing per symbol, compute TTM gross profit as the 4-quarter rolling sum of reported quarterly grossProfit
Pair with the same-quarter totalAssets from the balance sheet filing
Compute gp_a_ttm = ttm_gross_profit / totalAssets
As of the last day of each formation month t, for each symbol, pick the most recent quarterly observation with filingDate <= month_end(t)
Sort the resulting cross section into deciles on gp_a_ttm using NYSE-listed breakpoints
Form the long-short: D10 (high gp_a, profitable) minus D1 (low gp_a, unprofitable)
Hold for one month, rebalance monthly
Report mean, standard deviation, t-statistic, annualized Sharpe of the monthly long-short series

Pre-registered verdict thresholds, calibrated specifically for Novy-Marx’s smaller canonical magnitude of ~0.31% per month (the MAX and momentum verdicts used a 0.007 floor appropriate for their ~1% canonicals):

Replicated: mean(D10 - D1) > 0.002 AND t > 2
Degraded: 0 < mean(D10 - D1) ≤ 0.002 AND t > 2
Failed: mean ≤ 0 OR t ≤ 2
Inconclusive: a data quality issue prevents a clean call

The 0.002 floor is approximately 65% of the canonical 0.31% per month. Paper-specific calibration is disclosed in the results JSON and on this page.

Four specifications reported in full:

Primary (VW, NYSE breakpoints, stocks-only, TTM GP/A, winsorized). Closest analog to the original methodology the available data supports. Value-weighted with FMP snapshot marketCap. Stock-level next-month returns winsorized at [-0.90, +1.50].
Sensitivity A (EW, filtered, TTM GP/A, winsorized). Equal-weighted universe, price ≥ $5 and dollar volume ≥ $1M filter, same TTM score.
Sensitivity B (EW, filtered, quarterly single-Q GP/A, winsorized). Same filter, but the score uses only the most recent quarterly grossProfit rather than the full TTM rolling sum.
Sensitivity C (EW, no filter, TTM GP/A, no winsorization). The rawest specification. No liquidity filter, no winsorization.

The numbers

Primary specification

Value-weighted, NYSE breakpoints, stocks-only, TTM GP/A, winsorized.

Metric	Value
Sample months	123
Median stocks / month	3,097
D1 (unprofitable) mean monthly	+1.525%
D10 (profitable) mean monthly	+2.028%
D10 minus D1 mean monthly	+0.503%
D10 minus D1 t-statistic	+0.93
Annualized Sharpe of D10 minus D1	+0.29
Worst month	-15.96%
Best month	+25.36%

The primary goes in the right direction at roughly 1.6 times the canonical magnitude. The point estimate of 0.503% per month is larger than Novy-Marx’s reported ~0.31%, which is noteworthy. The reason it still fails the rubric is the t-statistic: +0.93, well below the |2| significance threshold. The long-short portfolio is too volatile to resolve a 0.5% per month signal in a 123-month window.

Pre-registered call: t ≤ 2, so the verdict is FAILED.

Sensitivity A (EW, filtered, TTM)

Metric	Value
Median stocks / month	2,437
D1 mean monthly	+1.731%
D10 mean monthly	+1.336%
D10 minus D1 mean monthly	-0.396%
t-statistic	-1.15

Equal-weighting flips the sign. Under EW with a liquidity filter, unprofitable stocks slightly outperform profitable stocks. This is an important asymmetry: the profitability effect is only visible on value-weighted large-cap portfolios in our sample, consistent with post-2013 literature that has increasingly characterized quality and profitability factors as a large-cap phenomenon. On equal-weighted mid-caps and small-caps, the direction reverses. Neither direction is statistically significant.

Sensitivity B (EW, filtered, quarterly GP/A)

Metric	Value
D10 minus D1 mean monthly	-0.438%
t-statistic	-1.28

Using single-quarter grossProfit rather than TTM gives essentially the same picture as Sensitivity A: slight inversion under EW, not significant.

Sensitivity C (EW, no filter, TTM)

Metric	Value
Median stocks / month	3,327
D10 minus D1 mean monthly	+0.565%
t-statistic	+0.82
Worst month	-11.78%
Best month	+66.30%

Removing the liquidity filter pulls the direction back to positive and the magnitude up to roughly the primary, but the standard error inflates on microcap tail risk (the +66.3% best month is a signature microcap event). Not significant.

What this means

Profitability is the closest of Nullberg’s first three verdicts to surviving. The VW NYSE-breakpoint primary goes in the direction the paper predicted, at roughly 1.6 times the paper’s own magnitude, on 123 months of independently sourced out-of-sample data with a clean point-in-time fundamentals join. What it lacks is statistical power. To clear |t| > 2 at this point estimate, we would need approximately four times the sample period, or a meaningful reduction in the long-short portfolio’s monthly standard deviation of ~6%.

Three additional observations the reader deserves:

The direction matters. Unlike MAX (sign inverted) and momentum (decayed to zero), profitability in VW form still points the correct way at roughly the canonical magnitude. A researcher with a longer sample would plausibly replicate the original claim. This matters for the archive’s credibility: Nullberg’s rubric is not mechanically stamping FAILED on every paper, and a factor that deserves benefit-of-the-doubt gets one.
The EW flip is economically meaningful. That the effect shows up only in VW and reverses in EW filtered mid-caps is consistent with the Asness, Frazzini, and Pedersen 2019 “Quality Minus Junk” literature which argues quality is a large-cap, high-liquidity phenomenon. Our data reinforces that: the profitability premium does not generalize down the market-cap distribution in 2016-2026.
Sample size is the bottleneck. For a factor with ~0.31% per month canonical and the monthly std we observed (~6%), the minimum sample for t = 2 is on the order of ~600 months under i.i.d. assumptions. Novy-Marx’s original 47-year sample (~570 months) barely cleared that bar. Ten years cannot.

By the pre-registered rubric, this is a failed replication across all four specifications. The additional observation that the primary is directionally right at supra-canonical magnitude is a qualitatively different kind of failure from the MAX inversion or the momentum decay, and is documented in full above so the reader can draw their own conclusions.

Comparative picture across three verdicts

Nullberg’s first three verdicts are now:

Paper	Headline result	Primary mean	Primary t	Kind of failure
MAX, Bali Cakici Whitelaw 2011	Sign inverted at significance	-1.80%	-2.57	Inversion
Momentum, Jegadeesh Titman 1993	Decayed to zero	-0.13%	-0.16	Decay
Profitability, Novy-Marx 2013 (this)	Right direction, underpowered	+0.50%	+0.93	Underpowered survivor

Three different factor classes (lottery, trend, fundamentals), three different failure modes, all three flagged FAILED by the same pre-registered rubric. The qualitative spread is the point: the replication crisis is not “everything stopped working the same way”. Different factors are failing in characteristic ways that map to different theories of why they worked in the first place.

Reproducibility

The replication is a single Python file reading the operator’s 10-year daily OHLCV cache plus raw FMP quarterly income statements and balance sheets. The fundamentals join is a point-in-time pandas.merge_asof on filingDate, which rules out look-ahead bias cleanly. Total runtime on a laptop is about 29 seconds.

Script: scripts/verdicts/novy_marx_2013_gross_profitability.py
Results JSON: scripts/verdicts/novy_marx_2013_gross_profitability.results.json
Monthly long-short series CSVs: ..._primary.csv, ..._sensA.csv, ..._sensB.csv, ..._sensC.csv

What we will track from here

This verdict enters the archive as failed and is reviewed when at least one of the following happens:

The sample extends enough forward months that the primary’s t-statistic crosses |2|. At the current point estimate of +0.93 over 123 months, an additional ~250 months of flat trajectory under the same mean and std would move the point toward significance. This is the most likely future change.
A cleaner industry-neutral or sector-controlled specification materially changes the picture.
An out-of-sample run on non-US developed markets, where Novy-Marx 2013 also reported the effect, produces a different verdict direction.
A strict time-varying value-weighted run with CRSP-quality shares-outstanding data materially changes the primary spec magnitude.

Update — 2026-04-11 — Newey-West and sub-sample stability backfill

Applying the backfill analysis committed in the Four verdicts, four shapes primer to the primary (VW NYSE-breakpoint, TTM GP/A, winsorized) monthly long-short series.

Newey-West robustness

Metric	Value
Sample months	123
Mean monthly D10 - D1	+0.503%
i.i.d. t-statistic	+0.93
Newey-West 12-lag t-statistic	+1.05

The Newey-West adjusted t is higher than the i.i.d. t, which is the opposite of the usual direction. This happens when the summed lag-autocovariances are mildly negative on net, implying that the long-run variance is smaller than the i.i.d. variance estimate. Economically, this is consistent with weak monthly mean-reversion in the profitability long-short return series. Either way, both statistics are below the |t| > 2 significance threshold. The “underpowered survivor” characterization is reinforced: the direction is right, the magnitude is at the canonical level, but the power is not there under either standard error construction. No specification clears significance in either direction.

Sub-sample stability

Half	Months	First month	Last month	Mean monthly	i.i.d. t
First half	61	2016-01	2021-01	+0.421%	+0.54
Second half	62	2021-02	2026-03	+0.584%	+0.77

Both halves are consistently positive. The first half averages +0.42% per month (t = +0.54), the second half +0.58% per month (t = +0.77). Neither half individually clears significance, but neither half flips sign. This is qualitatively different from the value verdict, where the full-sample replication was driven almost entirely by the 2022 rebound.

Profitability’s “underpowered survivor” characterization is strengthened by this split: the factor is showing up at roughly its canonical magnitude in both halves of the sample, consistently and without dramatic regime-dependence. The reason it fails the rubric is the monthly volatility of the long-short series, not regime-fragility. If the forward sample extends enough months to shrink the standard error without changing the mean, the verdict would flip to REPLICATED or DEGRADED.

What this sharpens

The original verdict said “underpowered survivor”. The backfill shows the survival is:

Consistent across sub-samples (both halves positive at similar magnitudes)
Robust to Newey-West (NW t slightly higher than i.i.d. t rather than lower)
Not regime-driven in the way the value verdict is

This is the most favorable picture any of the three failed verdicts produces, and it puts profitability at the top of the “candidate for future REPLICATED with more data” list.

Verdict impact

No change. The pre-registered rubric is mean ≤ 0 OR t ≤ 2 → FAILED. Mean is +0.503% and both t-statistics are below 2. Verdict remains FAILED. But the backfill sharpens the interpretation toward “consistently underpowered” rather than “underpowered-but-fragile”.

Update — 2026-04-11 — 2×3 size × profitability (PMU) construction

The Four verdicts, four shapes primer committed to running a Fama-French-style 2×3 size × profitability double sort on the gross profitability score, as a cross-verdict-comparable construction analogous to how the value verdict uses HML 2×3. This update delivers that. It does not replace the simple D10-D1 primary; it sits alongside it as an additional robustness spec.

Construction

Following the Asness-Frazzini-Pedersen “QMJ-style” factor construction, we build a 2×3 independent sort on (size, profitability):

Size breakpoint: NYSE median snapshot marketCap (same as the simple-decile primary)
Profitability breakpoints: NYSE 30th and 70th percentiles of the gp_a_ttm score
Six portfolios: size in {S, B} crossed with profitability in {U, N, P}, value-weighted within each
PMU (Profitable Minus Unprofitable): 0.5 × (SP + BP) − 0.5 × (SU + BU), the direct analog of HML’s (SV + BV)/2 − (SG + BG)/2

Results

Metric	2×3 PMU	Simple D10-D1 (original primary)
Sample months	122	123
Median stocks / month	3,037	3,097
Mean monthly	+0.500%	+0.503%
i.i.d. t-statistic	+1.74	+0.93
Newey-West 12-lag t	+1.58	+1.05
Annualized Sharpe	+0.55	+0.29
Worst month	-8.12%	-15.96%
Best month	+10.61%	+25.36%

The mean is essentially identical (+0.500% vs +0.503%), but the t-statistic nearly doubles under the 2×3 construction (from +0.93 to +1.74). This is the expected benefit of the 2×3 methodology: the 30/40/30 breakpoints produce cleaner, lower-variance portfolios than the 10/10 extreme-decile spread, so the same point estimate has tighter confidence intervals.

The 2×3 construction is still FAILED by the pre-registered rubric (t = +1.74 < 2), but the headline is much closer to the threshold than the simple decile version. A researcher looking only at the 2×3 construction would describe this as a “nearly-significant positive effect at the canonical magnitude” rather than the simple decile’s “not significant”.

Six-portfolio decomposition

Portfolio	Mean monthly
Small Unprofitable	+1.332%
Small Neutral	+1.057%
Small Profitable	+1.467%
Big Unprofitable	+1.402%
Big Neutral	+1.812%
Big Profitable	+2.268%

The effect lives on the big side. Big Profitable (+2.27%) cleanly dominates Big Unprofitable (+1.40%), for a big-side spread of +0.87% per month. Small Profitable (+1.47%) is barely above Small Unprofitable (+1.33%), for a small-side spread of only +0.14% per month.

This confirms the “large-cap-only” interpretation we flagged in the original verdict section. Profitability is a big-cap earnings-quality effect in the 2016-2026 regime. The small-cap end is essentially flat.

Sub-sample stability (regime split)

Half	Mean monthly	i.i.d. t
First half (2016 to early 2021)	+0.840%	+2.02
Second half (early 2021 to 2026)	+0.159%	+0.41

The first half of the 2x3 sample REPLICATES at t = +2.02, clearing the significance bar. The second half is flat. The pattern is the opposite of the value verdict’s regime split, where the first half was flat and the second half was strong (driven by the 2022 value rebound). That is a genuinely new archive finding: profitability and value show opposing regime cycles in the same 2016-2026 window. Profitability worked in the growth-dominated first half and faded during the 2021+ value rotation; value failed in the growth-dominated first half and replicated during the 2021+ rotation.

A plausible hypothesis (labeled as hypothesis, per the synthesis primer’s rule): during the 2016-2021 growth/tech boom, quality was priced as an attribute of large profitable mega-caps that drove the market. When the 2022 rate shock forced a rotation into beaten-down value names, those names were not the most-profitable ones, and the profitability premium faded into the value premium.

Verdict impact

No change. The pre-registered rubric is applied to the simple D10-D1 primary (simple verdict shapes are cross-comparable with all other Nullberg verdicts), and that primary is still FAILED. The 2×3 PMU update sharpens the interpretation:

Point estimate unchanged (+0.50%)
Statistical power improved (t from +0.93 to +1.74)
Still below the |2| significance bar on the full sample
First half alone clears significance (t = +2.02)
Second half is flat, the opposite pattern from value
Effect lives on the big-cap side, same as before

The archive’s interpretation of profitability sharpens to: a big-cap earnings-quality effect that worked in the 2016-2021 growth era and faded during the 2021+ value rotation, visible in the 2×3 construction but too underpowered on the full 10-year sample to clear formal significance.

Reproducibility

Script: scripts/verdicts/profitability_2x3_update.py
Results JSON: scripts/verdicts/profitability_2x3.results.json
Monthly PMU CSV: scripts/verdicts/profitability_2x3.monthly_ls.csv

Bibliography

Novy-Marx, Robert. “The other side of value: The gross profitability premium.” Journal of Financial Economics 108(1), 2013, pp. 1-28. Paper
Bali, Turan G., Nusret Cakici, and Robert F. Whitelaw. “Maxing Out: Stocks as Lotteries and the Cross-Section of Expected Returns.” Journal of Financial Economics 99(2), 2011, pp. 427-446. Nullberg verdict: failed, inverted.
Jegadeesh, Narasimhan, and Sheridan Titman. “Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency.” Journal of Finance 48(1), 1993, pp. 65-91. Nullberg verdict: failed, decayed.
Newey, Whitney K., and Kenneth D. West. “A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.” Econometrica 55(3), 1987, pp. 703-708. HAC estimator used in the backfill update.
Fama, Eugene F., and Kenneth R. French. “Common risk factors in the returns on stocks and bonds.” Journal of Financial Economics 33(1), 1993, pp. 3-56. The HML 2×3 construction adapted here for PMU.