Kelly in the Wild: Against the Ensemble

You're watching a prediction market. An event is about to resolve. The market has it at 41%. Your model says 58%. Seventeen points of edge.

You have 828ms.

That's the full cycle: ingest the data, compute the Bayesian posterior, compare it to the market price, execute the order. 828ms average. 1776ms at the p99.

The question isn't just "am I right?" It's: how much do I bet, and do I have time?

Kelly gives you the first answer. But Kelly alone isn't a system. It requires seven ideas working together. Miss any one of them and the formula breaks in a different way. This is what those seven ideas look like, and why each one is load-bearing.

The Edge Equation

The entire game fits in one line:

\text{EV} = \hat{p} - p

Your true probability minus the market's price. If the market says 41% and you say 58%, your edge is 17 cents per dollar of payout.

Kelly converts that edge into a stake size:

f^* = \frac{\hat{p} - p}{1 - p}

For our example: $(0.58 - 0.41) / (1 - 0.41) = 0.17 / 0.59 \approx 28.8\%$ of bankroll.

TRY IT

your estimate p̂
58%
click to edit
VS
market price p
41%
click to edit
EV = p̂ − p+0.170
Kelly fraction f*28.8% of bankroll
f* = (p̂ − p) / (1 − p)  · bet YES at 28.8% of bankroll

Set your estimate and the market price. Kelly fraction updates instantly. Notice what happens when they're equal.

Two things jump out. First: the raw EV (p̂ - p) depends only on the gap, not the level. A 60% estimate vs a 55% market and a 40% estimate vs a 35% market both produce 5 cents of expected value per dollar of face value. But the Kelly fraction is not the same: f* = (p̂ - p)/(1 - p), so the denominator grows smaller as p rises, and the same EV gap produces a larger Kelly bet at higher probability levels. Level matters for sizing, even when EV is equal.

Second: when $\hat{p} = p$ the Kelly fraction is exactly zero. No edge, no bet.

◆The formula as disciplineTip

Set p̂ equal to p and Kelly drops to zero automatically. You don't have to decide when to pass on a trade. The math decides for you.

To even write this formula, you need to know $p$ , the market's current price. Which raises the obvious question: how does the market actually set that price?

LMSR: The Market Is a Neural Network

Most prediction markets use LMSR (Logarithmic Market Scoring Rule) as their pricing engine. Once you see how it works, you can't look at a market price the same way again.

The cost function for $n$ outcomes is:

C(\mathbf{q}) = b \cdot \ln\!\left(\sum_{i=1}^n e^{q_i/b}\right)

Where $\mathbf{q}$ is the vector of outstanding shares for each outcome and $b$ is the liquidity parameter. The instantaneous price is the derivative:

p_i(\mathbf{q}) = \frac{\partial C}{\partial q_i} = \frac{e^{q_i/b}}{\sum_j e^{q_j/b}}

Look familiar? That's the softmax function. The exact same function neural network classifiers use to output probability distributions over classes.

iThe softmax connection is exactNote

This is not an analogy. The LMSR price function is definitionally softmax(q/b). The market is running the same computation as the output layer of a neural network classifier, with outstanding share quantities playing the role of logits.

The market is literally a softmax layer. It takes outstanding quantities as input and outputs probabilities as prices. When you buy YES shares you increase $q_{\text{yes}}$ , which shifts the softmax output toward YES. Every trader's action feeds into the same aggregation mechanism a neural network uses to weight its outputs.

$b$ controls liquidity depth: larger $b$ means a flatter curve, so each trade moves the price less. The tradeoff is the market maker's maximum possible loss, which grows as $L_{\max} = b \cdot \ln(n)$ . With $b = 100000$ on a binary market, that puts the worst-case loss at $L_{\max} = b\ln 2 \approx 69315$ .

TRY IT

Current Price
50.0%
0255075100balanced← NOYES →
p = softmax(q_yes, q_no) · b = 100 · each trade: 20 units

Each trade shifts the softmax distribution. Watch the price curve: the dot is your current position on it.

There's a deeper connection here. LMSR prices are logarithmic. Kelly sizing is logarithmic. Both are doing information accounting. The Kelly growth rate equals the KL divergence between your belief and the market price: a precise measure of how much your information is worth. When the LMSR price equals your $\hat{p}$ , the divergence is zero and Kelly says bet nothing. Zero information, zero edge. The two were built for each other.

This isn't just an analogy. The maximum profit you can extract from a mispriced market has a closed-form expression. For LMSR it equals the KL divergence between the market's current prices and the closest set of prices that would be impossible to arbitrage:

\text{max profit} = D_{\text{KL}}(\mu^* \| \theta)

where $\theta$ is the current LMSR price vector and $\mu^*$ is the nearest arbitrage-free price distribution (the point where no trade can produce guaranteed profit). The upper bound on what you can extract is fully determined by how far the current prices are from that point. "I have edge" and "my KL divergence from the market is positive" are the same claim.

So now you're competing against a machine that prices beliefs via softmax and updates continuously as every other trader acts. Your only advantage is a better $\hat{p}$ . Which makes the accuracy of your estimate the whole game.

But "accurate" doesn't mean "high confidence." It means calibrated.

Calibration

Think about the last ten times you said you were 90% sure about something. Were you right nine times out of ten? Most people aren't. They're right maybe six or seven. Not because they're stupid but because confidence feels internal and accuracy is external, and the two drift apart without a feedback loop.

Kelly doesn't know about this. Feed it overconfident estimates and it sizes too big and slowly drains your account. The damage is non-linear: on a small-edge bet, even a modest overestimate can flip you from a profitable strategy to a losing one entirely.

The fix is tracking. Every forecast gets a record. You build calibration curves over months. You learn your own bias and correct for it before it hits the formula. Superforecasters don't have better instincts. They have better feedback loops.

But there's a second way calibration fails. Your markets can individually sum to $1.00 while still being collectively wrong. If Republicans winning Pennsylvania by 5+ points is priced at 32% and Trump winning Pennsylvania is at 48%, the logic is broken: one implies the other. No single-market calibration loop catches this. A 2024 election analysis found 1,576 dependent market pairs where exactly this kind of inconsistency existed. Median mispricing: $0.60. Markets regularly 40% wrong, per-market calibration scores looking fine the whole time. You need consistency checking across your full portfolio of beliefs, not just accuracy tracking per market.

Calibration also connects to why the Kelly criterion is formulated the way it is. The formula assumes $\hat{p}$ is your true probability. If it isn't, all bets are off, literally.

Which raises the next question: what exactly does Kelly optimize? Knowing that clarifies why miscalibration is so expensive.

Log Utility

Say you're running $1,000. You size aggressively, lose 50% on a bad call. You're at $500. To recover you don't need a 50% gain. You need a 100% gain. Not the same thing.

This is the asymmetry that maximizing expected wealth misses. It treats a +50% and a -50% as symmetrical. They aren't. Log utility captures the difference precisely: $\log(100) - \log(50) = 0.693$ but $\log(150) - \log(100) = 0.405$ . A 50% loss hurts more than a 50% gain helps, and the formula should reflect that.

Kelly sizing is exactly what you get when you maximize expected log wealth. Not a heuristic. The closed-form solution. Everything else about the criterion follows from this, including why calibration errors hurt so much: an error in $\hat{p}$ distorts the log-wealth calculation at the root.

So the formula is clean. The problem is getting a good $\hat{p}$ before the market moves.

Real-Time Bayesian Updating

All of the above assumes you already have $\hat{p}$ . In a fast market, getting $\hat{p}$ in time is the actual problem.

Here's how the 828ms breaks down:

TRY IT

Componentavgp99
Data ingestionAPI / websocket
120ms340ms
Bayesian posteriorlog-space update
15ms28ms
LMSR price comparisonsoftmax delta
3ms8ms
Order executionCLOB submit
690ms1400ms
TOTAL CYCLE828ms1776ms
avg latency
p99 latency

Both avg and p99 bars shown. Order execution dominates. That 690ms is where most of the window lives.

The cycle is tight. At p99 you're at 1776ms, and on fast markets the resolution window can close before that. The Bayesian update has to be fast and it has to be correct. You don't get to rerun it.

The update rule is sequential. For a stream of signals $D_1, D_2, \ldots, D_t$ :

\log P(H \mid \mathbf{D}) = \log P(H) + \sum_{k=1}^t \log P(D_k \mid H) - \log Z

Why log-space? Numerical stability at speed. Probabilities near 0 or 1 underflow in floating point at standard precision. Log-probabilities don't. At 15ms per update cycle, you cannot afford numerical edge cases in production.

Each signal multiplies the odds by its likelihood ratio. A bullish signal with $\text{LR} = 4$ adds $\ln(4) \approx 1.39$ to the log-odds. Signals accumulate additively in log-space rather than multiplicatively in probability space. Simple, fast and stable.

But there's a second thing to watch in the update stream. Signal reliability isn't constant. After a major information event (a large poll drop, a breaking news item), every subsequent signal is noisier: more actors reacting, more order flow, higher variance on what the likelihood ratio actually is. A fixed-LR update that worked in a quiet market will overcount evidence in a volatile one. The fix is adaptive: weight each signal proportionally to the inverse of the recent variance of the signal stream. When volatility spikes, trust each individual signal less. Volatility in financial returns clusters the same way: this is why models like GARCH exist, to track time-varying noise levels rather than assuming a constant signal quality. The same logic applies here, not to returns, but to the reliability of your incoming information. The p99 latency issue and this one push in the same direction: when information is arriving fast and the market is moving, your posterior is less certain than it looks. Size accordingly.

TRY IT

Posterior P(H|D)
50.0%
log-odds: +0.00
Signals received
0
prior: 50%
0-2+2log P(H|D) over timeno signals yet
log P(H|D) = log P(H) + Σ log P(D_k|H) − log Z

Each click applies a new signal. The curve tracks your log-posterior over time. Notice how the prior (50%) anchors you before any signals arrive.

!NEVER full Kelly on 5min marketsWarning

The reason lives in the p99. At 1776ms you frequently miss your entry window. If you size at full Kelly and miss half your entries, the strategy's expected value is now wrong: you're sampling biased toward trades you could execute, not the ones you calculated Kelly for.

You've now got the mechanics. But there's a deeper reason why all of this matters so much more than it looks on paper.

Ergodicity

There's something missing from the Kelly derivation you'll usually see. The standard proof goes: imagine 1000 agents betting simultaneously, Kelly maximizes expected log wealth across the ensemble. But you're not 1000 agents.

You're one person running one account through time.

That distinction is everything. And it's not obvious why until you look at a concrete example.

Suppose you flip a coin repeatedly, betting 40% of your wealth each time. Heads: gain 50%. Tails: lose 40%. The expected value per flip is positive: $0.5 \times 1.5 + 0.5 \times 0.6 = 1.05$ . A statistician looking at the ensemble says: great, positive EV, keep playing.

But compound this long enough and your bankroll reliably trends toward zero. The geometric mean per flip is $\sqrt{1.5 \times 0.6} \approx 0.95$ , which is less than 1. After 100 flips, the average bankroll across 1000 parallel players is up. But the median player has essentially nothing. The ensemble average is dragged up by a few lucky survivors on fat-right tails. The typical path ruins.

This is what non-ergodicity means. The time average and the ensemble average diverge. When they diverge, optimizing for the ensemble gets you killed on the actual path you walk.

It's not a toy problem. You can simulate it yourself: start 10,000 agents at $1,000, run 200 rounds of this coin flip, and look at the distribution. The mean bankroll is high. The median is near zero. A handful of agents are millionaires. The rest are broke. The ensemble metric says "great strategy." The median agent says "I'm ruined."

There's a third level beyond this. When enough traders run the same model, their behavior synchronizes. In 2000, the Millennium Bridge in London opened. Pedestrians individually optimized their balance on the swaying structure. That individual optimization produced collective lateral oscillation that nearly brought the bridge down: each person's rational local adjustment reinforced everyone else's. Quantitative trading systems do the same thing. Enough funds running the same signal means they withdraw liquidity simultaneously. Prices gap in ways no individual model predicted. You can be running a perfectly ergodic Kelly strategy and still get hit by ensemble dynamics you aren't part of but which move the market you're trading in.

Kelly is optimal specifically for the time-average case. It maximizes growth on your one trajectory through time, not the average across hypothetical parallel ones. The reason Kelly uses logarithms isn't aesthetic. It's because log-wealth is the quantity that is ergodic: its time average and ensemble average converge. Maximizing expected log-wealth and maximizing the growth rate of a single compounding path are the same problem.

The Bayesian updating matters because of this. You're not choosing a fixed strategy at the start and running it 1000 times. You're adapting a single path as new information arrives. Each 828ms cycle is one update on your irreversible trajectory. Get it wrong and there's no ensemble to average over. The path forks once, and you're on it.

The implication for position sizing is stark: any bet size above Kelly is non-ergodic in the bad direction. Not "slightly suboptimal." Reliably ruinous over time, even when the bet has positive expected value.

Drawdown

Full Kelly maximizes long-run growth rate. It does not minimize drawdown.

On a binary bet with slightly wrong calibration, full Kelly can produce 40-50% drawdowns before recovering. Recovery math is brutal: a 50% drawdown requires a 100% gain. The log-utility asymmetry that Kelly was designed to capture now works against you when $\hat{p}$ is off.

◆Half-Kelly arithmeticTip

Half the bet size gives about 75% of the long-run growth rate and only 25% of the variance. For anyone with imperfect calibration (everyone), this tradeoff strongly favors half-Kelly.

In fast markets the problem compounds. Uncertainty about $\hat{p}$ is higher because information arrives faster than you can fully process it. The 828ms constraint means your posterior is always slightly stale. Kelly is sensitive to $\hat{p}$ errors: overestimate your edge by 10 percentage points and full Kelly dramatically oversizes.

Drawdown from sizing is the one you control. But there's a second kind you don't: execution drawdown. Polymarket uses a central limit order book (CLOB): trades are matched one at a time from a queue of posted orders, not executed atomically as a bundle. If you're arbitraging a mispricing that spans two related markets, you have to send two separate orders. The first fills. The second market reprices against you before you can send it. The numbers are specific: in one large-scale arbitrage study, 48% of execution failures came from insufficient simultaneous liquidity across both legs, and 31% from price movement between the first and second fill. A perfectly sized, perfectly calibrated bet produces a loss. The math was right. The market moved in between.

The warning in the callout above wasn't paranoia. It was arithmetic.

Seven ideas. All pointing the same direction.

Closing

Go back to the opening. The market is at 41%. Your model says 58%. You have 828ms.

Now you know what all of that means.

The 41% is a softmax output, aggregated across every trade that has hit the LMSR book. Your 58% is a Bayesian posterior, log-updated in 15ms on incoming signals. The 17-point gap is the edge: $\hat{p} - p = 0.17$ , which Kelly converts to 28.8% of bankroll, capped at half-Kelly because your calibration is imperfect and your posterior is slightly stale. You have 120ms to ingest the data, 15ms to think, 3ms to compare against the market price, and 690ms to get the order in.

Of those 828ms, 18 are computation. The rest is plumbing.

And this is one path through time. There is no ensemble to bail you out if you oversize. There is no second run. Log utility, ergodicity, calibration, drawdown: they all point at the same thing. Respect the asymmetry. The upside and the downside are not mirror images of each other.

Kelly is the entry point. The 828ms is where it becomes real.