How we compute consensus probabilities and the peer-reviewed research behind each model.
We compute six different consensus models on every multi-platform question. Each model takes a different theoretical approach to combining probabilities. Model B (Geometric Mean of Odds) is our default because it has the strongest empirical validation in the forecast aggregation literature, but we report all six so users and researchers can compare.
Each platform contributes one probability per question. Same-platform duplicates are merged via volume-weighted averaging before any model runs, so no platform can vote twice.
What it does: Converts each platform's probability to log-odds, takes the simple arithmetic mean of those log-odds, and converts back to a probability. Equal weight across all platforms.
Formula: consensus = sigmoid( mean( log(p / (1 - p)) ) )
Why it's the default: In the IARPA ACE forecasting tournament — the largest controlled comparison of probability aggregation methods ever run — logit pooling consistently outperformed linear averaging, especially in the tail (probabilities near 0 or 1). Linear averages over-moderate extreme forecasts; log-odds pooling preserves them. Satopää et al. (2014) formalized this and showed it as the most accurate aggregation rule across thousands of geopolitical questions.
Strengths: Strongest empirical validation. Handles extreme probabilities without over-shrinking. Symmetric in p and 1−p.
Weaknesses: Equal weighting ignores volume/liquidity differences between platforms. Sensitive to outliers if one platform is far from the others.
Satopää, V. A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014). "Combining multiple probability predictions using a simple logit model." International Journal of Forecasting, 30(2), 344–356. doi:10.1016/j.ijforecast.2013.09.009
Baron, J., Mellers, B. A., Tetlock, P. E., Stone, E., & Ungar, L. H. (2014). "Two reasons to make aggregated probability forecasts more extreme." Decision Analysis, 11(2), 133–145. doi:10.1287/deca.2014.0293
Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., et al. (2015). "Identifying and cultivating superforecasters as a method of improving probabilistic predictions." Perspectives on Psychological Science, 10(3), 267–281. doi:10.1177/1745691615577794
What it does: A weighted arithmetic average where each platform's weight equals its liquidity (or volume if liquidity is unavailable). Real-money platforms with deep order books dominate.
Formula: consensus = sum(p_i * w_i) / sum(w_i) where w_i is liquidity.
Rationale: Higher-liquidity markets are presumed to have more informed prices because larger trades are required to move them. Volume weighting is a standard technique in financial price aggregation (Hayek 1945; Wolfers & Zitzewitz 2004 on prediction-market efficiency).
Strengths: Reflects "skin-in-the-game" weighting. Naturally down-weights play-money platforms.
Weaknesses: Linear averaging over-shrinks tail probabilities (Baron et al. 2014). Liquidity is not the same as accuracy — a single liquid but biased market dominates the consensus. Volume weighting in forecast aggregation has not been validated as more accurate than equal weighting in controlled comparisons.
Wolfers, J., & Zitzewitz, E. (2004). "Prediction Markets." Journal of Economic Perspectives, 18(2), 107–126. doi:10.1257/0895330041371321
Manski, C. F. (2006). "Interpreting the predictions of prediction markets." Economics Letters, 91(3), 425–429. doi:10.1016/j.econlet.2006.01.004
What it does: For multi-outcome events, treats a missing candidate on a platform as information — the platform implicitly assigns it less than the listing threshold. We impute the missing value at threshold / 2 (where the threshold is the lowest listed candidate's probability on that platform), then compute a normalized average across all platforms.
Why this matters: Missing candidates in prediction markets are not missing-completely-at-random (MCAR). Per Rubin's (1976) classification they are missing not at random (MNAR): the probability of being missing is related to the value itself (low-probability candidates are not listed). Standard mean imputation is biased under MNAR. Treating absence as a low-probability signal is one way to account for this without ignoring the missing data entirely.
Strengths: Captures information that "this platform considered the candidate not worth listing." Specifically designed for multi-outcome elections where candidate sets differ across platforms.
Weaknesses: The threshold / 2 imputation point is a heuristic, not derived from data. Has not been validated against resolution outcomes.
Why we think it will work:
[0, threshold]. We are not claiming it is the true value; we are claiming it is the least-biased point estimate consistent with what the platform's listing decision tells us.Status: This model is our novel contribution. There is no direct published precedent. The MNAR framing comes from Rubin (1976); the broader missing-data literature (Little & Rubin 2019) provides the theoretical grounding. We treat the threshold/2 choice as a hypothesis to be tested against resolution data, not as a settled answer.
Rubin, D. B. (1976). "Inference and missing data." Biometrika, 63(3), 581–592. doi:10.1093/biomet/63.3.581
Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley. doi:10.1002/9781119482260
What it does: Plain arithmetic average across platforms that explicitly listed the candidate. No imputation for missing candidates. Normalized so all candidates in a multi-outcome event sum to 100%.
Formula: consensus = mean(p_i) over listed platforms only.
Rationale: The simplest possible aggregation rule. We include it specifically as a reference baseline — because it does no imputation, it has zero MNAR bias from imputation choices, making it useful for measuring how much imputation in other models actually changes the consensus.
Strengths: Trivially explainable. Zero imputation bias. Standard baseline in the forecast combination literature (Clemen 1989; Genest & Zidek 1986).
Weaknesses: Linear averages over-shrink tails (Baron et al. 2014). Ignores volume/liquidity. Treats absent platforms as if they don't exist, even when their absence is informative.
Clemen, R. T. (1989). "Combining forecasts: A review and annotated bibliography." International Journal of Forecasting, 5(4), 559–583. doi:10.1016/0169-2070(89)90012-5
Genest, C., & Zidek, J. V. (1986). "Combining Probability Distributions: A Critique and an Annotated Bibliography." Statistical Science, 1(1), 114–135. doi:10.1214/ss/1177013825
What it does: Weights each platform by its historical accuracy on resolved markets. Accuracy is measured by Brier score — the squared difference between the platform's final probability and the realized outcome (Brier 1950). Lower Brier = more accurate = higher weight in the consensus. Per-category Brier scores allow the weighting to differ between sports, politics, and economics.
Formula: Bayesian Model Averaging where the posterior weight on each platform is proportional to its historical likelihood: w_i = exp(-BrierScore_i / temperature).
Rationale: Bayesian Model Averaging is the principled way to combine forecasts when you have evidence about each forecaster's reliability (Hoeting et al. 1999; Raftery et al. 2005). Tetlock's Good Judgment Project demonstrated that small groups of accurate forecasters ("superforecasters") consistently beat the crowd; this model is the platform-level analogue.
Status: Activation pending. We are collecting resolution data on every market we track. Activation requires ~100 resolved questions per platform per category. Until then this model returns "n/a" and the default remains Model B.
Brier, G. W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1–3. doi:10.1175/1520-0493(1950)078
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). "Bayesian Model Averaging: A Tutorial." Statistical Science, 14(4), 382–401. doi:10.1214/ss/1009212519
Raftery, A. E., Gneiting, T., Balabdaoui, F., & Polakowski, M. (2005). "Using Bayesian Model Averaging to Calibrate Forecast Ensembles." Monthly Weather Review, 133(5), 1155–1174. doi:10.1175/MWR2906.1
Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., et al. (2014). "Psychological strategies for winning a geopolitical forecasting tournament." Psychological Science, 25(5), 1106–1115. doi:10.1177/0956797614524255
What it does: A hybrid of Model B and Model A. Converts each probability to log-odds (Model B's transform), then takes a weighted average using sqrt(volume) as the weight (rather than equal weighting). Square root prevents one mega-volume market from dominating.
Formula: consensus = sigmoid( sum(sqrt(v_i) * logit(p_i)) / sum(sqrt(v_i)) )
Hypothesis: Combine the tail-preservation of log-odds pooling (Satopää et al. 2014) with the informed-price intuition of volume weighting (Wolfers & Zitzewitz 2004). Both components are independently studied; this specific combination is not.
Why we think it might work:
sqrt(volume) rather than raw volume prevents one mega-volume Polymarket market from drowning out three medium-volume markets. Square-root scaling is the standard fix in financial econometrics for heavy-tailed weights (it is the same logic as why portfolio risk uses standard deviation rather than variance, and why TF-IDF uses log term frequencies).How we will know if it worked: Once we have ~100 resolved questions per platform we will compute Brier scores for both Model B and Model F on the same resolution set. Lower Brier wins. If Model F beats Model B by a statistically meaningful margin (paired Diebold-Mariano test, p < 0.05) we promote it to default. If not, we retire it.
Status: Experimental. The components have separate peer-reviewed empirical validation but the specific hybrid has no published precedent. We compute it on every question so that once we have enough resolution data we can test whether it actually beats Model B in Brier score.
Satopää et al. (2014) and Wolfers & Zitzewitz (2004) — full citations under Models B and A above.
Before any model runs, every question goes through a 5-phase matching and validation pipeline:
Same-platform duplicates — e.g., a market relisted under a slightly different title — are merged using volume-weighted averaging before consensus is computed. Each platform contributes exactly one vote per question.
Baron, J., Mellers, B. A., Tetlock, P. E., Stone, E., & Ungar, L. H. (2014). "Two reasons to make aggregated probability forecasts more extreme." Decision Analysis, 11(2), 133–145.
Brier, G. W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1–3.
Clemen, R. T. (1989). "Combining forecasts: A review and annotated bibliography." International Journal of Forecasting, 5(4), 559–583.
Genest, C., & Zidek, J. V. (1986). "Combining Probability Distributions: A Critique and an Annotated Bibliography." Statistical Science, 1(1), 114–135.
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). "Bayesian Model Averaging: A Tutorial." Statistical Science, 14(4), 382–401.
Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley.
Manski, C. F. (2006). "Interpreting the predictions of prediction markets." Economics Letters, 91(3), 425–429.
Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., et al. (2015). "Identifying and cultivating superforecasters as a method of improving probabilistic predictions." Perspectives on Psychological Science, 10(3), 267–281.
Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., et al. (2014). "Psychological strategies for winning a geopolitical forecasting tournament." Psychological Science, 25(5), 1106–1115.
Raftery, A. E., Gneiting, T., Balabdaoui, F., & Polakowski, M. (2005). "Using Bayesian Model Averaging to Calibrate Forecast Ensembles." Monthly Weather Review, 133(5), 1155–1174.
Rubin, D. B. (1976). "Inference and missing data." Biometrika, 63(3), 581–592.
Satopää, V. A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014). "Combining multiple probability predictions using a simple logit model." International Journal of Forecasting, 30(2), 344–356.
Wolfers, J., & Zitzewitz, E. (2004). "Prediction Markets." Journal of Economic Perspectives, 18(2), 107–126.
The only prediction market aggregator with academically validated cross-platform consensus.