How we compute consensus probabilities and the peer-reviewed research behind each of our four models.
We compute four consensus models on every multi-platform question. Each model takes a different theoretical approach to combining probabilities. Model A (Inverse-Variance Weighted Log-Odds) is our default because it combines the strongest elements from the meta-analysis, forecast aggregation, and prediction market literatures.
Each platform contributes one probability per question. Same-platform duplicates are merged via volume-weighted averaging before any model runs, so no platform can vote twice. Sources that have not traded in 14+ days are excluded from consensus but still displayed (Rothschild 2009).
What it does: Converts each platform’s probability to log-odds, then takes a weighted average where higher-volume (or higher-participation) markets get more weight. The weights are derived from the inverse-variance framework used in medical meta-analysis since 1986. A shrinkage parameter blends these weights with equal weights to protect against estimation error with a small number of sources.
Formula: w_i = λ/K + (1−λ) × vol_i / Σvol, then consensus = sigmoid(Σ w_i × logit(p_i))
Why it’s the default: This model synthesizes three lines of peer-reviewed research:
Strengths: Higher-volume markets get more influence (a $3M Polymarket market outweighs a $6K Gemini market). Operates in log-odds space (preserves tail probabilities). Shrinkage protects against overfitting with small N. Gracefully degrades to Model B when no volume data is available.
Weaknesses: Assumes variance is inversely proportional to volume, which is approximate. The shrinkage parameter λ = 0.5 is a starting point; optimal λ requires calibration on resolved data.
DerSimonian, R., & Laird, N. (1986). “Meta-analysis in clinical trials.” Controlled Clinical Trials, 7(3), 177–188. doi:10.1016/0197-2456(86)90046-2
Liu, L., Hao, X., & Wang, Y. (2024). “Solving the forecast combination puzzle using double shrinkages.” Oxford Bulletin of Economics and Statistics. doi:10.1111/obes.12590
Arieli, I., Babichenko, Y., & Smorodinsky, R. (2018). “Robust forecast aggregation.” Proceedings of the National Academy of Sciences, 115(52), E12135–E12143. doi:10.1073/pnas.1813934115
What it does: Converts each platform’s probability to log-odds, takes the simple arithmetic mean of those log-odds, and converts back to a probability. Equal weight across all platforms.
Formula: consensus = sigmoid( mean( logit(p_i) ) )
Role: Fallback model when volume/forecaster data is unavailable. Also serves as the equal-weight reference against which Model A’s volume-weighting can be compared. In the IARPA ACE forecasting tournament, logit pooling consistently outperformed linear averaging, especially at the tails.
Strengths: Strongest empirical validation for equal-weight aggregation. Handles extreme probabilities without over-shrinking. Satisfies External Bayesianity (Genest 1984). Symmetric in p and 1−p.
Weaknesses: Equal weighting ignores volume/liquidity differences between platforms. Sensitive to outliers if one platform is far from the others.
Satopää, V. A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014). "Combining multiple probability predictions using a simple logit model." International Journal of Forecasting, 30(2), 344–356. doi:10.1016/j.ijforecast.2013.09.009
Baron, J., Mellers, B. A., Tetlock, P. E., Stone, E., & Ungar, L. H. (2014). "Two reasons to make aggregated probability forecasts more extreme." Decision Analysis, 11(2), 133–145. doi:10.1287/deca.2014.0293
Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., et al. (2015). "Identifying and cultivating superforecasters as a method of improving probabilistic predictions." Perspectives on Psychological Science, 10(3), 267–281. doi:10.1177/1745691615577794
What it does: Plain arithmetic average across all platforms. The simplest possible aggregation rule.
Formula: consensus = mean(p_i)
Role: Diagnostic baseline. Ranjan & Gneiting (2010) proved that linear opinion pools are always miscalibrated (pulled toward 50%), so this model should not be used as a primary output. We include it as a sanity check: if Model A performs worse than this simple average on resolved data, something is broken.
Strengths: Trivially explainable. Standard baseline in the forecast combination literature (Clemen 1989).
Weaknesses: Provably miscalibrated (Ranjan & Gneiting 2010, Theorem 1). Over-shrinks tails (Baron et al. 2014). Ignores volume/liquidity.
Clemen, R. T. (1989). “Combining forecasts: A review and annotated bibliography.” International Journal of Forecasting, 5(4), 559–583. doi:10.1016/0169-2070(89)90012-5
What it does: Weights each platform by its historical accuracy on resolved markets, computed in log-odds space to avoid the Ranjan-Gneiting miscalibration. Accuracy is measured by Brier score (Brier 1950). Lower Brier = more accurate = higher weight. Per-category scores allow the weighting to differ between sports, politics, and economics.
Formula: consensus = sigmoid(Σ w_i × logit(p_i) / Σ w_i), where w_i = 1 / BrierScore_i
Rationale: Bayesian Model Averaging (Hoeting et al. 1999) is the principled way to combine forecasts when you have evidence about each source’s reliability. The Good Judgment Project demonstrated that weighting by track record consistently outperforms equal weighting when sufficient calibration data exists (Mellers et al. 2014). This version applies weights in log-odds space rather than linear space to avoid the Ranjan-Gneiting miscalibration.
Status: Activation pending. We are collecting resolution data on every market we track. Activation requires 30+ resolved questions per platform per category (Smith & Wallis 2009 show that with K sources, O(K²) events are needed before estimated weights reliably beat equal weights). Until then this model returns “n/a” and the default remains Model A.
Brier, G. W. (1950). “Verification of forecasts expressed in terms of probability.” Monthly Weather Review, 78(1), 1–3. doi:10.1175/1520-0493(1950)078
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). “Bayesian Model Averaging: A Tutorial.” Statistical Science, 14(4), 382–401. doi:10.1214/ss/1009212519
Mellers, B., Ungar, L., Baron, J., et al. (2014). “Psychological strategies for winning a geopolitical forecasting tournament.” Psychological Science, 25(5), 1106–1115. doi:10.1177/0956797614524255
Not all prediction market sources are equal. A market with zero trading activity may show a stale price that reflects one trader's idiosyncratic belief rather than aggregated information. Including such sources in the consensus degrades accuracy rather than improving it.
The forecast aggregation literature consistently finds that curating which sources enter the pool improves aggregate accuracy:
No single canonical number exists in the literature, but several empirical studies provide reference points:
| Criterion | Threshold | Source |
|---|---|---|
| Min trades per market | ≥ 25 | Restocchi et al. 2023, J. of Forecasting |
| Min unique traders | ≥ 18–20 | Dreber et al. 2015, PNAS 112(50) |
| Inactivity exclusion | No trades in 14 days | Rothschild 2009, Public Opinion Quarterly |
| Bid-ask spread | ≤ 5 percentage points | Rothschild 2009 |
We apply two filters before a source enters the consensus pool:
Sources that pass both filters receive inverse-variance weight in Model A, where volume determines each platform’s weight (with shrinkage toward equal weights). Sources without volume data use forecaster count as a proxy.
Mannes, A. E., Soll, J. B., & Larrick, R. P. (2014). “The wisdom of select crowds.” Journal of Personality and Social Psychology, 107(2), 276–299. doi:10.1037/a0036677
Budescu, D. V., & Chen, E. (2015). “Identifying expertise to extract the wisdom of crowds.” Management Science, 61(2), 267–280. doi:10.1287/mnsc.2014.1909
Restocchi, V., McGroarty, F., & Sherrington, C. (2023). “Price formation in field prediction markets: the wisdom in the crowd.” Journal of Forecasting, 42(7). doi:10.1002/for.2974
Dreber, A., Pfeiffer, T., Almenberg, J., et al. (2015). “Using prediction markets to estimate the reproducibility of scientific research.” PNAS, 112(50), 15343–15347. doi:10.1073/pnas.1516179112
Rothschild, D. (2009). “Forecasting elections: comparing prediction markets, polls, and their biases.” Public Opinion Quarterly, 73(5), 895–916. doi:10.1093/poq/nfp082
Smith, J., & Wallis, K. F. (2009). “A simple explanation of the forecast combination puzzle.” Oxford Bulletin of Economics and Statistics, 71(3), 331–355. doi:10.1111/j.1468-0084.2008.00541.x
Our pipeline runs every 6 hours (00:00, 06:00, 12:00, 18:00 UTC). Each run:
price_snapshots table.consensus_snapshots.resolutions table, and compute per-platform Brier scores for Model D activation.Total data: ~72,000 markets across 5 platforms, ~1,000 cross-platform canonical questions with consensus.
We aim for transparency about what our consensus can and cannot do.
Four independent lines of evidence converge on log-odds pooling (geometric mean of odds) as the strongest non-extremized aggregation method:
Baron et al. (2014) showed that extremizing — multiplying the averaged log-odds by a factor d ≈ 2.0 before converting back to probability — improves accuracy for all aggregation methods including log-odds pooling. The improvement corrects two biases: (1) compression from averaging pushes the aggregate toward 50%, and (2) forecasters with shared information sources under-adjust for the collective evidence available to the group.
The optimal d depends on information overlap among sources. Prediction markets likely share high information overlap (they all react to the same news), so d = 2.0 may be too aggressive. We plan to calibrate d once we have sufficient resolution data, using the same Brier score approach described under Model D.
Genest, C. (1984). “A characterization theorem for externally Bayesian groups.” Annals of Statistics, 12(3), 1100–1105. doi:10.1214/aos/1176346731
Ranjan, R., & Gneiting, T. (2010). “Combining probability forecasts.” Journal of the Royal Statistical Society, Series B, 72(1), 71–91. doi:10.1111/j.1467-9868.2009.00726.x
Atanasov, P., Rescober, P., Stone, E., et al. (2016). “Distilling the wisdom of crowds: prediction markets vs. prediction polls.” Management Science, 63(3), 691–706. doi:10.1287/mnsc.2015.2374
Before any model runs, every question goes through a 5-phase matching and validation pipeline:
Same-platform duplicates — e.g., a market relisted under a slightly different title — are merged using volume-weighted averaging before consensus is computed. Each platform contributes exactly one vote per question.
Three models from our original six-model lineup were retired in April 2026:
Atanasov, P., Rescober, P., Stone, E., et al. (2016). “Distilling the wisdom of crowds: prediction markets vs. prediction polls.” Management Science, 63(3), 691–706.
Baron, J., Mellers, B. A., Tetlock, P. E., Stone, E., & Ungar, L. H. (2014). “Two reasons to make aggregated probability forecasts more extreme.” Decision Analysis, 11(2), 133–145.
Brier, G. W. (1950). “Verification of forecasts expressed in terms of probability.” Monthly Weather Review, 78(1), 1–3.
Budescu, D. V., & Chen, E. (2015). “Identifying expertise to extract the wisdom of crowds.” Management Science, 61(2), 267–280.
Clemen, R. T. (1989). “Combining forecasts: A review and annotated bibliography.” International Journal of Forecasting, 5(4), 559–583.
Dreber, A., Pfeiffer, T., Almenberg, J., et al. (2015). “Using prediction markets to estimate the reproducibility of scientific research.” PNAS, 112(50), 15343–15347.
Genest, C. (1984). “A characterization theorem for externally Bayesian groups.” Annals of Statistics, 12(3), 1100–1105.
Genest, C., & Zidek, J. V. (1986). “Combining Probability Distributions: A Critique and an Annotated Bibliography.” Statistical Science, 1(1), 114–135.
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). “Bayesian Model Averaging: A Tutorial.” Statistical Science, 14(4), 382–401.
Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley.
Mannes, A. E., Soll, J. B., & Larrick, R. P. (2014). “The wisdom of select crowds.” Journal of Personality and Social Psychology, 107(2), 276–299.
Manski, C. F. (2006). “Interpreting the predictions of prediction markets.” Economics Letters, 91(3), 425–429.
Mellers, B., Stone, E., Murray, T., et al. (2015). “Identifying and cultivating superforecasters as a method of improving probabilistic predictions.” Perspectives on Psychological Science, 10(3), 267–281.
Mellers, B., Ungar, L., Baron, J., et al. (2014). “Psychological strategies for winning a geopolitical forecasting tournament.” Psychological Science, 25(5), 1106–1115.
Raftery, A. E., Gneiting, T., Balabdaoui, F., & Polakowski, M. (2005). “Using Bayesian Model Averaging to Calibrate Forecast Ensembles.” Monthly Weather Review, 133(5), 1155–1174.
Ranjan, R., & Gneiting, T. (2010). “Combining probability forecasts.” Journal of the Royal Statistical Society, Series B, 72(1), 71–91.
Restocchi, V., McGroarty, F., & Sherrington, C. (2023). “Price formation in field prediction markets: the wisdom in the crowd.” Journal of Forecasting, 42(7).
Rothschild, D. (2009). “Forecasting elections: comparing prediction markets, polls, and their biases.” Public Opinion Quarterly, 73(5), 895–916.
Rubin, D. B. (1976). “Inference and missing data.” Biometrika, 63(3), 581–592.
Satopää, V. A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014). “Combining multiple probability predictions using a simple logit model.” International Journal of Forecasting, 30(2), 344–356.
Smith, J., & Wallis, K. F. (2009). “A simple explanation of the forecast combination puzzle.” Oxford Bulletin of Economics and Statistics, 71(3), 331–355.
Wolfers, J., & Zitzewitz, E. (2004). “Prediction Markets.” Journal of Economic Perspectives, 18(2), 107–126.
Wolfers, J., & Zitzewitz, E. (2006). “Interpreting prediction market prices as probabilities.” NBER Working Paper No. 12200.
The only prediction market aggregator with academically validated cross-platform consensus.