← THE INDEX  ·  QUANT

Prediction Market Bot Post-Mortem

The bot lost money. This repo is the audit that found out why, and why it took three passes to get there.

What it is

The bot itself (the live strategy, order routing, and API credentials) is not in this repository. What is here is the part that turned out to be the actually useful artifact: the 138-trade live dataset, the evaluation framework that runs walk-forward backtests against it, and the written investigation that eventually diagnosed the real problem after three incorrect passes.

The headline numbers: the bot traded Kalshi single-degree (2°F) temperature brackets, betting NO. It lost $160 in 97 trades before an April 2026 calibration patch. The next 41 trades came in at roughly +$3, essentially breakeven. The cumulative chart still showed a drawdown, because 100% of the loss was already locked in before the patch.

The live trade data is committed at data/sample_trades.psv: public Kalshi market tickers, no account IDs, no API tokens, no PII. Every number in the post-mortem is reproducible from that file.

The payout math that settled it

The decisive cut was not a model comparison. It was arithmetic:

Quantity Value
Actual bracket hit rate 62/138 = 44.9%
Average NO win +$3.32
Average NO loss −$6.46
Realized reward:risk 0.51
Break-even win rate at 0.51 66.2%
Bot's actual win rate 54.3%

You cannot make money betting NO on near-coin-flip events when the payout structure demands 66%. This is a market-selection problem, not a model problem. Single-degree brackets sit below NWS/ensemble forecast resolution and Kalshi prices them efficiently. No probability model (Gaussian, empirical, ensemble-fit, or otherwise) can produce edge in a market that doesn't have any.

eval/empirical_analysis.py: walk-forward backtest: empirical vs Gaussian, era-split
def walk_forward(rows, min_edge):
    """Chronological. For trade i, train empirical model on rows 0..i-1 only.
    Simulates both Gaussian and empirical strategies: bet NO if p_hit <= mkt_p_hit - min_edge.
    """
    res = {"gauss": dict(n=0, wins=0, pnl=0.0, brier=[]),
           "emp":   dict(n=0, wins=0, pnl=0.0, brier=[])}
    usable = [r for r in rows
              if r["nws"] is not None and r["mkt_p_hit"] is not None
              and r["pnl"] is not None and r["cnt"]]
    for i, r in enumerate(usable):
        train = usable[:i]
        if len(train) < 15:
            continue  # warm-up period
        table, base = build_empirical(train)
        pe = emp_p_hit(table, base, r["nws"], r["mid"])
        pg = gaussian_p_hit(r["nws"], r["lo"], r["hi"])
        mp = r["mkt_p_hit"]
        per_contract_no_win = 1.0 - r["mkt_price"]
        for tag, p in (("gauss", pg), ("emp", pe)):
            d = res[tag]
            if p <= mp - min_edge:      # bet NO
                d["n"] += 1
                won = (r["hit"] == 0)
                d["pnl"] += (r["cnt"] * per_contract_no_win) if won else (-r["cost"])
                d["wins"] += 1 if won else 0
                d["brier"].append(brier(p, r["hit"]))
    return res, usable

The three misdiagnoses

A logging gap hid the truth from three consecutive audit passes. The INSERT INTO trades statement listed columns explicitly and omitted three diagnostic fields: raw_ensemble_probability, model_count, and models_used. Every trade row showed model_count = 1 and raw_ensemble_probability = NULL. Any inspector (human or automated) looking at that table concluded the 31-member ensemble pipeline was dead.

It wasn't. The ensemble ran correctly on every trade. The columns were simply never written.

Pass 1 (April 2026 audit): Blamed a dead OPENMETEO_PROXY node. A fix was deployed. For a non-problem.

Pass 2 (May 2026, first pass): Repeated the same 'ensemble pipeline died' error.

Pass 3 (May 2026, second pass): Over-corrected to 'the MAE sigma floor is destroying the signal and causing the bleed'. Also wrong. The era split disproves it.

The honest record of that oscillation is preserved in docs/INVESTIGATION.md. The fix (adding the three missing columns to the INSERT) is a ten-line change recommended regardless of any strategy decision, because it stops the misdiagnosis loop.

The evaluation framework

empirical_analysis.py is stdlib-only and reads the committed PSV directly. It runs a chronological walk-forward comparing two probability models at four edge thresholds: Gaussian (NWS MAE-parameterized) and an empirical bracket-hit lookup built from prior trades only. Output includes Brier scores, win rates, per-model P&L, and the empirical P(hit) table binned by forecast distance from bracket midpoint.

effective_exposure.py implements the 'effective exposure' discount logic: when a position's current market price indicates it is essentially already won or lost, the bot discounts it from the trading gate's exposure check so capital isn't blocked by quasi-decided bets. c4_eval.py is the gate-before-restart harness: a pre-committed PASS/FAIL verdict based on shadow-mode Brier score and EV after fees, designed to enforce a no-capital evaluation window before any future strategy re-launch.

The meta lesson: build c4_eval.py first, in shadow mode, with the gate criteria written down before you look at the numbers. The cheapest audit is the one you run before any capital is at risk.