The hypothesis
The premise was that a 31-member GFS ensemble plus the NWS point forecast could out-predict retail traders on Kalshi weather contracts, and that a Gaussian probability model fed by that ensemble would find mispriced single-degree temperature brackets to bet NO on. The earlier research notes pushed this hard: a variable Kalshi fee model (ceil(0.07 × contracts × price × (1−price)) instead of a flat $0.05), extremized log-odds aggregation of ensemble + NWS + base rates, GFS run-timing awareness (data lands ~3.5h after 00/06/12/18Z), and explicit longshot-bias avoidance, all aimed at squeezing a 7-9% edge past the fee threshold.
The model itself was not the problem. Live testing confirmed the ensemble pipeline ran correctly, 31 members, ~1.7°F spread. The problem was the market the model was pointed at. The post-2024 Kalshi regime change is unforgiving: after the volume explosion from $30M to $820M/quarter, professional market makers entered and takers now lose on average. Any edge had to come from genuinely better information, and on single-degree brackets there was none to be had.
The evaluation harness
The repository ships the part worth keeping. The walk-forward backtester (empirical_analysis.py, standard library only) replays the committed trade dataset chronologically: for trade i it trains a Laplace-smoothed empirical P(hit) model on trades 0..i−1 only, compares it against the live Gaussian, and reports Brier, win rate, EV, and total P&L at four edge thresholds. The strict no-lookahead split is the whole point, it is what separates a real backtest from curve-fitting.
def gaussian_p_hit(nws, lo, hi, mae=DEFAULT_MAE_FALLBACK): sigma = mae * math.sqrt(math.pi / 2.0) z_lo = (lo - nws) / sigma z_hi = (hi - nws) / sigma cdf = lambda z: 0.5 * (1.0 + math.erf(z / math.sqrt(2))) return max(0.0, cdf(z_hi) - cdf(z_lo)) def brier(pred, outcome): return (pred - outcome) ** 2
The decision-gate evaluator (c4_eval.py) runs unattended on cron with no Claude session: it pulls shadow predictions, backfills outcomes from the Kalshi API, applies a hard liquidity filter, and scores Brier plus post-fee EV against criteria written down in advance. The EV and verdict logic is verbatim:
fee = main.kalshi_taker_fee(px) won = (outcome == 1) if side_yes else (outcome == 0) ev = ((1.0 - px) - fee) if won else (-(px + fee)) evs.append(ev) ... cell_ok = bool(best_cell and best_cell[1] >= 10 and best_cell[3] > 0) passed = (ev_mean > 0) and (brier < BRIER_GATE) and cell_ok
What the backtest showed
Two cuts settle the case. First, the calibration table, empirical P(hit) by distance of the NWS forecast from the bracket midpoint, against what the Gaussian predicted. The Gaussian is wrong in the same direction everywhere, badly underestimating the hit rate by 0.24 to 0.65:
| |forecast−mid| < | n | emp P(hit) | Gaussian P(hit) | gap |
|---|---|---|---|---|
| 1.5° | 4 | 0.750 | 0.105 | +0.645 |
| 3.5° | 60 | 0.350 | 0.085 | +0.265 |
| 5.0° | 28 | 0.429 | 0.062 | +0.367 |
| 8.0° | 12 | 0.667 | 0.029 | +0.638 |
The Gaussian Brier across all resolved trades is 0.3705, worse than the 0.25 of a blind coin flip. Second, the walk-forward P&L at four edge thresholds, the Gaussian loses at every band, and the loss only deepens as you demand more edge (because the model's confidence is anti-correlated with reality):
| min_edge | trades | WR | P&L | EV/trade | Brier |
|---|---|---|---|---|---|
| 0.45 | 5 | 20.0% | −$8.05 | −$1.610 | 0.7345 |
| 0.35 | 20 | 35.0% | −$29.32 | −$1.466 | 0.5785 |
| 0.25 | 75 | 52.0% | −$104.15 | −$1.389 | 0.4259 |
| 0.15 | 97 | 56.7% | −$76.81 | −$0.792 | 0.3825 |
The walk-forward loss of −$104.15 at min_edge=0.25 reproduces the bot's live −$94 in that band (the drift is data-join and unclamped-era accounting). The empirical model is better calibrated (Brier 0.2554 vs 0.3748 on all resolved) but it almost never finds a tradeable edge, which is itself the finding: there is nothing to trade.
Why the edge died
An era-split of the P&L was the decisive cut. Nearly 100% of the lifetime loss happened before the 2026-04-21 overconfidence-clamp patch; afterwards the bot was essentially break-even, not profitable. The −$157 total and 48% drawdown on the cumulative chart were old damage, not fresh losses.
| Era | Trades | Avg ensemble prob | P&L | EV/trade |
|---|---|---|---|---|
| Pre-Apr-21 (unclamped Gaussian) | 97 | 0.01-0.14 | −$160.72 | −$1.66 |
| Post-Apr-21 (MAE-σ floor active) | 41 | ~0.10 | +$3.06 | ≈ $0.00 |
The payout math explains the floor at break-even. The narrow brackets hit 62/138 = 44.9% of the time, near coin flips. The realized reward:risk was 0.51 (average win +$3.32, average loss −$6.46), which demands a break-even win rate of 1 / (1 + 0.51) ≈ 66.2%. The bot's actual win rate was 54.3%. You cannot make money betting NO on near-coin-flip events when the payout structure requires a 66% win rate. The Apr-21 MAE-σ floor crudely clamped the model's hit probability up to ~10%, which capped the catastrophic overconfidence but could never manufacture an edge that the market does not contain. Single-degree brackets sit below NWS/ensemble forecast resolution and Kalshi prices them efficiently.
The fixes that did and didn't work
Several fixes were tried across the bot's life; the honest accounting is mixed, and one fix was deployed against a bug that never existed.
- Worked: the MAE-σ floor. It stopped the pre-Apr-21 catastrophic bleed by clamping the Gaussian's hit-probability floor. Do not remove it, removing it reproduces the −$160 era. But it produced break-even, not profit.
- Worked: the variable Kalshi fee formula. The flat $0.05 estimate was 2.5-5× too high on mid-priced contracts, which had been silently rejecting trades with 7-9% true edge. Correcting the fee math is real, but it only matters if an edge exists to clear it.
- Worked, narrowly: the hybrid bracket probability
max(Gaussian, raw_count±0.5°F), which caught converged-ensemble cases the Gaussian smeared out (Chicago: Gaussian said 31%, raw count showed 74%). - Didn't work / rejected: a ±2°F NWS-distance guard blocked 7 winners and 0 losses for −$16.19 net, NWS distance is not a predictor of bracket failure. A METAR entry filter was dead code: trades are placed 12-30h before observations become informative.
- The fix against a non-problem: the 2026-04-27 audit blamed a dead
OPENMETEO_PROXYnode and fixed it. The ensemble pipeline was never dead. That memory entry is now flagged invalid.
The root cause of the misdiagnosis loop was a logging gap. The INSERT INTO trades statement omitted three diagnostic columns (raw_ensemble_probability, model_count, models_used), so every row showed model_count = 1 and a NULL ensemble probability. Three separate audits, an earlier one and the first two passes of this one, read that and concluded the 31-member ensemble was dead. It was not; the columns were simply never written. The bug never cost a cent of P&L, but it cost three audit cycles and one deployed fix on a non-problem. The oscillation is preserved in the writeup rather than smoothed over.
What I'd do differently
The cheapest thing you can do before shipping a bot is build the evaluator first, in shadow mode, with the gate criteria written down before you look at the numbers, then build the strategy. The unbuilt pivot spec encodes that discipline. The first shadow scans immediately exposed why a naive restart would just repeat the bleed: most logged markets were px = $0.01 with the model claiming 0.18-0.64 edge, deep-longshot illiquid contracts where the huge edges are model-overconfidence artifacts, not alpha. Hence the hard liquidity filter (px ≥ $0.10, volume ≥ 20) before any EV is computed at all.
- Select the market first. Verify a payout structure can clear a defensible win rate before tuning any model. The pivot kills narrow brackets entirely (
MIN_BRACKET_WIDTH = 5.0°F) and only trades threshold markets when|forecast − threshold| ≥ 1.5 × city_MAE, the zones where NWS genuinely beats retail. - Gate restart on a no-capital evaluation. A pre-committed rule: ≥30 resolved liquid shadow predictions, Brier < 0.25, clearly positive post-fee EV, holding in the highest-volume city/market-type cell (not one lucky cluster). A structural
kalshi_place_order()no-op underSHADOW_MODEmakes risking capital impossible by construction, not by a single boolean. - Treat a no-edge market as a stop signal. For a strategy with no edge, not trading is the correct play. There is zero historical data on the wider/threshold markets, so the pivot cannot be backtested, it requires a shadow data-collection window before any capital. With that appetite absent, the bot was retired. That is the right answer, and the framework is what made it defensible.