Methodology

How we compute consensus probabilities and the peer-reviewed research behind each model.

Overview

We compute six different consensus models on every multi-platform question. Each model takes a different theoretical approach to combining probabilities. Model B (Geometric Mean of Odds) is our default because it has the strongest empirical validation in the forecast aggregation literature, but we report all six so users and researchers can compare.

Each platform contributes one probability per question. Same-platform duplicates are merged via volume-weighted averaging before any model runs, so no platform can vote twice.

Model B — Geometric Mean of Odds DEFAULT

What it does: Converts each platform's probability to log-odds, takes the simple arithmetic mean of those log-odds, and converts back to a probability. Equal weight across all platforms.

Formula: consensus = sigmoid( mean( log(p / (1 - p)) ) )

Why it's the default: In the IARPA ACE forecasting tournament — the largest controlled comparison of probability aggregation methods ever run — logit pooling consistently outperformed linear averaging, especially in the tail (probabilities near 0 or 1). Linear averages over-moderate extreme forecasts; log-odds pooling preserves them. Satopää et al. (2014) formalized this and showed it as the most accurate aggregation rule across thousands of geopolitical questions.

Strengths: Strongest empirical validation. Handles extreme probabilities without over-shrinking. Symmetric in p and 1−p.

Weaknesses: Equal weighting ignores volume/liquidity differences between platforms. Sensitive to outliers if one platform is far from the others.

Satopää, V. A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014). "Combining multiple probability predictions using a simple logit model." International Journal of Forecasting, 30(2), 344–356. doi:10.1016/j.ijforecast.2013.09.009

Baron, J., Mellers, B. A., Tetlock, P. E., Stone, E., & Ungar, L. H. (2014). "Two reasons to make aggregated probability forecasts more extreme." Decision Analysis, 11(2), 133–145. doi:10.1287/deca.2014.0293

Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., et al. (2015). "Identifying and cultivating superforecasters as a method of improving probabilistic predictions." Perspectives on Psychological Science, 10(3), 267–281. doi:10.1177/1745691615577794

Model A — Liquidity-Weighted Linear

What it does: A weighted arithmetic average where each platform's weight equals its liquidity (or volume if liquidity is unavailable). Real-money platforms with deep order books dominate.

Formula: consensus = sum(p_i * w_i) / sum(w_i) where w_i is liquidity.

Rationale: Higher-liquidity markets are presumed to have more informed prices because larger trades are required to move them. Volume weighting is a standard technique in financial price aggregation (Hayek 1945; Wolfers & Zitzewitz 2004 on prediction-market efficiency).

Strengths: Reflects "skin-in-the-game" weighting. Naturally down-weights play-money platforms.

Weaknesses: Linear averaging over-shrinks tail probabilities (Baron et al. 2014). Liquidity is not the same as accuracy — a single liquid but biased market dominates the consensus. Volume weighting in forecast aggregation has not been validated as more accurate than equal weighting in controlled comparisons.

Wolfers, J., & Zitzewitz, E. (2004). "Prediction Markets." Journal of Economic Perspectives, 18(2), 107–126. doi:10.1257/0895330041371321

Manski, C. F. (2006). "Interpreting the predictions of prediction markets." Economics Letters, 91(3), 425–429. doi:10.1016/j.econlet.2006.01.004

Model C — Threshold Imputation NOVEL

What it does: For multi-outcome events, treats a missing candidate on a platform as information — the platform implicitly assigns it less than the listing threshold. We impute the missing value at threshold / 2 (where the threshold is the lowest listed candidate's probability on that platform), then compute a normalized average across all platforms.

Why this matters: Missing candidates in prediction markets are not missing-completely-at-random (MCAR). Per Rubin's (1976) classification they are missing not at random (MNAR): the probability of being missing is related to the value itself (low-probability candidates are not listed). Standard mean imputation is biased under MNAR. Treating absence as a low-probability signal is one way to account for this without ignoring the missing data entirely.

Strengths: Captures information that "this platform considered the candidate not worth listing." Specifically designed for multi-outcome elections where candidate sets differ across platforms.

Weaknesses: The threshold / 2 imputation point is a heuristic, not derived from data. Has not been validated against resolution outcomes.

Why we think it will work:

  1. Listing thresholds carry information. Prediction market operators do not list every conceivable candidate — they list the ones with non-trivial trading interest. The decision to not list a candidate is a real-world signal that the operator (and the bettors who would have to provide order flow) consider the probability too low to support a market. Throwing away that signal — the way Model D does — treats the absence as if the platform never had an opinion, when in fact it had a strong negative one.
  2. Threshold/2 is a defensible point estimate under MNAR. Rubin's (1976) classification tells us standard imputation is biased here, but it does not tell us to ignore the data. In a one-sided censoring problem — where missing always means "below some cutoff" — the midpoint between zero and the cutoff is the maximum-entropy choice given only that the value is somewhere in [0, threshold]. We are not claiming it is the true value; we are claiming it is the least-biased point estimate consistent with what the platform's listing decision tells us.
  3. It self-corrects across platforms. A candidate listed on Polymarket at 4% but missing from Manifold and PredictIt gets imputed at roughly 0.5% on the missing platforms. After normalization across all candidates this pulls the consensus down toward the platforms that took the candidate seriously, which is the correct direction in MNAR.
  4. Empirical falsification path. If the threshold/2 imputation is wrong, Model D (no imputation) will produce a lower Brier score than Model C on resolved low-probability candidates. This is testable, which is why we run both side by side.

Status: This model is our novel contribution. There is no direct published precedent. The MNAR framing comes from Rubin (1976); the broader missing-data literature (Little & Rubin 2019) provides the theoretical grounding. We treat the threshold/2 choice as a hypothesis to be tested against resolution data, not as a settled answer.

Rubin, D. B. (1976). "Inference and missing data." Biometrika, 63(3), 581–592. doi:10.1093/biomet/63.3.581

Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley. doi:10.1002/9781119482260

Model D — Equal Weight Linear (MNAR-safe baseline)

What it does: Plain arithmetic average across platforms that explicitly listed the candidate. No imputation for missing candidates. Normalized so all candidates in a multi-outcome event sum to 100%.

Formula: consensus = mean(p_i) over listed platforms only.

Rationale: The simplest possible aggregation rule. We include it specifically as a reference baseline — because it does no imputation, it has zero MNAR bias from imputation choices, making it useful for measuring how much imputation in other models actually changes the consensus.

Strengths: Trivially explainable. Zero imputation bias. Standard baseline in the forecast combination literature (Clemen 1989; Genest & Zidek 1986).

Weaknesses: Linear averages over-shrink tails (Baron et al. 2014). Ignores volume/liquidity. Treats absent platforms as if they don't exist, even when their absence is informative.

Clemen, R. T. (1989). "Combining forecasts: A review and annotated bibliography." International Journal of Forecasting, 5(4), 559–583. doi:10.1016/0169-2070(89)90012-5

Genest, C., & Zidek, J. V. (1986). "Combining Probability Distributions: A Critique and an Annotated Bibliography." Statistical Science, 1(1), 114–135. doi:10.1214/ss/1177013825

Model E — Bayesian Track Record PENDING DATA

What it does: Weights each platform by its historical accuracy on resolved markets. Accuracy is measured by Brier score — the squared difference between the platform's final probability and the realized outcome (Brier 1950). Lower Brier = more accurate = higher weight in the consensus. Per-category Brier scores allow the weighting to differ between sports, politics, and economics.

Formula: Bayesian Model Averaging where the posterior weight on each platform is proportional to its historical likelihood: w_i = exp(-BrierScore_i / temperature).

Rationale: Bayesian Model Averaging is the principled way to combine forecasts when you have evidence about each forecaster's reliability (Hoeting et al. 1999; Raftery et al. 2005). Tetlock's Good Judgment Project demonstrated that small groups of accurate forecasters ("superforecasters") consistently beat the crowd; this model is the platform-level analogue.

Status: Activation pending. We are collecting resolution data on every market we track. Activation requires ~100 resolved questions per platform per category. Until then this model returns "n/a" and the default remains Model B.

Brier, G. W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1–3. doi:10.1175/1520-0493(1950)078

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). "Bayesian Model Averaging: A Tutorial." Statistical Science, 14(4), 382–401. doi:10.1214/ss/1009212519

Raftery, A. E., Gneiting, T., Balabdaoui, F., & Polakowski, M. (2005). "Using Bayesian Model Averaging to Calibrate Forecast Ensembles." Monthly Weather Review, 133(5), 1155–1174. doi:10.1175/MWR2906.1

Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., et al. (2014). "Psychological strategies for winning a geopolitical forecasting tournament." Psychological Science, 25(5), 1106–1115. doi:10.1177/0956797614524255

Model F — Volume-Weighted Log-Odds (experimental hybrid) UNTESTED

What it does: A hybrid of Model B and Model A. Converts each probability to log-odds (Model B's transform), then takes a weighted average using sqrt(volume) as the weight (rather than equal weighting). Square root prevents one mega-volume market from dominating.

Formula: consensus = sigmoid( sum(sqrt(v_i) * logit(p_i)) / sum(sqrt(v_i)) )

Hypothesis: Combine the tail-preservation of log-odds pooling (Satopää et al. 2014) with the informed-price intuition of volume weighting (Wolfers & Zitzewitz 2004). Both components are independently studied; this specific combination is not.

Why we think it might work:

  1. Both inputs are individually validated. Log-odds pooling has the strongest tournament evidence for tail-preservation (Satopää et al. 2014). Volume/liquidity is the most informative auxiliary signal we have about which platform's price is more credible (Wolfers & Zitzewitz 2004 on prediction-market efficiency; Manski 2006 on the relationship between price and informed belief). The hybrid asks: if both signals are individually useful, does combining them in the principled way — weight in the log-odds domain rather than the probability domain — do better than either alone?
  2. Log-odds is the right space to weight in. Log-odds is the additive scale for probabilistic evidence (Good 1950 on weight-of-evidence). Weighting probabilities directly (as Model A does) compresses the tails because the underlying scale is non-linear. Weighting log-odds preserves the tail-preservation property of Model B even when the weights are unequal — the math works out cleanly because log-odds is closed under weighted averaging.
  3. Square-root weighting is a known regularizer. Using sqrt(volume) rather than raw volume prevents one mega-volume Polymarket market from drowning out three medium-volume markets. Square-root scaling is the standard fix in financial econometrics for heavy-tailed weights (it is the same logic as why portfolio risk uses standard deviation rather than variance, and why TF-IDF uses log term frequencies).
  4. If volume really is a quality signal, this should beat Model B. Model B treats a $50M Polymarket market and a $5K Manifold market as equal voters. Either that is correct (volume is uninformative about accuracy — Model B wins) or it is incorrect (volume carries signal — Model F wins). We do not know which a priori, which is exactly why this is an experiment.

How we will know if it worked: Once we have ~100 resolved questions per platform we will compute Brier scores for both Model B and Model F on the same resolution set. Lower Brier wins. If Model F beats Model B by a statistically meaningful margin (paired Diebold-Mariano test, p < 0.05) we promote it to default. If not, we retire it.

Status: Experimental. The components have separate peer-reviewed empirical validation but the specific hybrid has no published precedent. We compute it on every question so that once we have enough resolution data we can test whether it actually beats Model B in Brier score.

Satopää et al. (2014) and Wolfers & Zitzewitz (2004) — full citations under Models B and A above.

Quality Assurance Pipeline

Before any model runs, every question goes through a 5-phase matching and validation pipeline:

  1. Parent event matching: Multi-outcome events (e.g., "2028 GOP nominee") are matched at the event level first, so candidates can be aligned across platforms.
  2. Candidate alignment: Within matched parent events, individual candidates are paired across platforms using slug and title similarity.
  3. LLM-verified binary matching: For binary questions, candidate pairs are verified by an LLM (Claude Haiku) using the title and resolution criteria.
  4. Star graph merge: Prevents false transitive merges (e.g., A↔B and B↔C should not always imply A↔C). Uses a connected-component check on the match graph.
  5. Opus validation: Final pass with a stronger LLM (Claude Opus) catches inverted phrasings ("Trump wins" vs "Trump loses"), scope mismatches, and timeframe mismatches.

Same-platform duplicates — e.g., a market relisted under a slightly different title — are merged using volume-weighted averaging before consensus is computed. Each platform contributes exactly one vote per question.

Full reference list

Baron, J., Mellers, B. A., Tetlock, P. E., Stone, E., & Ungar, L. H. (2014). "Two reasons to make aggregated probability forecasts more extreme." Decision Analysis, 11(2), 133–145.

Brier, G. W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1–3.

Clemen, R. T. (1989). "Combining forecasts: A review and annotated bibliography." International Journal of Forecasting, 5(4), 559–583.

Genest, C., & Zidek, J. V. (1986). "Combining Probability Distributions: A Critique and an Annotated Bibliography." Statistical Science, 1(1), 114–135.

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). "Bayesian Model Averaging: A Tutorial." Statistical Science, 14(4), 382–401.

Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley.

Manski, C. F. (2006). "Interpreting the predictions of prediction markets." Economics Letters, 91(3), 425–429.

Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., et al. (2015). "Identifying and cultivating superforecasters as a method of improving probabilistic predictions." Perspectives on Psychological Science, 10(3), 267–281.

Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., et al. (2014). "Psychological strategies for winning a geopolitical forecasting tournament." Psychological Science, 25(5), 1106–1115.

Raftery, A. E., Gneiting, T., Balabdaoui, F., & Polakowski, M. (2005). "Using Bayesian Model Averaging to Calibrate Forecast Ensembles." Monthly Weather Review, 133(5), 1155–1174.

Rubin, D. B. (1976). "Inference and missing data." Biometrika, 63(3), 581–592.

Satopää, V. A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014). "Combining multiple probability predictions using a simple logit model." International Journal of Forecasting, 30(2), 344–356.

Wolfers, J., & Zitzewitz, E. (2004). "Prediction Markets." Journal of Economic Perspectives, 18(2), 107–126.

Odds Raven

The only prediction market aggregator with academically validated cross-platform consensus.

© 2026 Odds Raven · Data refreshed every 6 hours · peer-reviewed methodology