Methodology

How we compute consensus probabilities and the peer-reviewed research behind each of our four models.

Overview

We compute four consensus models on every multi-platform question. Each model takes a different theoretical approach to combining probabilities. Model A (Inverse-Variance Weighted Log-Odds) is our default because it combines the strongest elements from the meta-analysis, forecast aggregation, and prediction market literatures.

Each platform contributes one probability per question. Same-platform duplicates are merged via volume-weighted averaging before any model runs, so no platform can vote twice. Sources that have not traded in 14+ days are excluded from consensus but still displayed (Rothschild 2009).

Model A — Inverse-Variance Weighted Log-Odds DEFAULT

What it does: Converts each platform’s probability to log-odds, then takes a weighted average where higher-volume (or higher-participation) markets get more weight. The weights are derived from the inverse-variance framework used in medical meta-analysis since 1986. A shrinkage parameter blends these weights with equal weights to protect against estimation error with a small number of sources.

Formula: w_i = λ/K + (1−λ) × vol_i / Σvol, then consensus = sigmoid(Σ w_i × logit(p_i))

Why it’s the default: This model synthesizes three lines of peer-reviewed research:

Log-odds pooling (Satopää et al. 2014; Genest 1984) — avoids the provable miscalibration of linear averaging (Ranjan & Gneiting 2010, Theorem 1).
Inverse-variance weighting (DerSimonian & Laird 1986, 12,000+ citations) — the standard meta-analysis framework. Markets with more volume have lower price-discovery variance, so they get proportionally more weight. This is how medical researchers have combined studies of 20 patients with studies of 20,000 since the 1980s.
Shrinkage toward equal weights (Liu et al. 2024; Smith & Wallis 2009) — with only 4–6 platforms, estimated weights are noisy. Blending with equal weights (λ = 0.5) reduces mean squared error of the aggregate. This addresses the “forecast combination puzzle” while still allowing volume to matter.

Strengths: Higher-volume markets get more influence (a $3M Polymarket market outweighs a $6K Gemini market). Operates in log-odds space (preserves tail probabilities). Shrinkage protects against overfitting with small N. Gracefully degrades to Model B when no volume data is available.

Weaknesses: Assumes variance is inversely proportional to volume, which is approximate. The shrinkage parameter λ = 0.5 is a starting point; optimal λ requires calibration on resolved data.

DerSimonian, R., & Laird, N. (1986). “Meta-analysis in clinical trials.” Controlled Clinical Trials, 7(3), 177–188. doi:10.1016/0197-2456(86)90046-2

Liu, L., Hao, X., & Wang, Y. (2024). “Solving the forecast combination puzzle using double shrinkages.” Oxford Bulletin of Economics and Statistics. doi:10.1111/obes.12590

Arieli, I., Babichenko, Y., & Smorodinsky, R. (2018). “Robust forecast aggregation.” Proceedings of the National Academy of Sciences, 115(52), E12135–E12143. doi:10.1073/pnas.1813934115

Model B — Geometric Mean of Odds

What it does: Converts each platform’s probability to log-odds, takes the simple arithmetic mean of those log-odds, and converts back to a probability. Equal weight across all platforms.

Formula: consensus = sigmoid( mean( logit(p_i) ) )

Role: Fallback model when volume/forecaster data is unavailable. Also serves as the equal-weight reference against which Model A’s volume-weighting can be compared. In the IARPA ACE forecasting tournament, logit pooling consistently outperformed linear averaging, especially at the tails.

Strengths: Strongest empirical validation for equal-weight aggregation. Handles extreme probabilities without over-shrinking. Satisfies External Bayesianity (Genest 1984). Symmetric in p and 1−p.

Weaknesses: Equal weighting ignores volume/liquidity differences between platforms. Sensitive to outliers if one platform is far from the others.

Satopää, V. A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014). "Combining multiple probability predictions using a simple logit model." International Journal of Forecasting, 30(2), 344–356. doi:10.1016/j.ijforecast.2013.09.009

Baron, J., Mellers, B. A., Tetlock, P. E., Stone, E., & Ungar, L. H. (2014). "Two reasons to make aggregated probability forecasts more extreme." Decision Analysis, 11(2), 133–145. doi:10.1287/deca.2014.0293

Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., et al. (2015). "Identifying and cultivating superforecasters as a method of improving probabilistic predictions." Perspectives on Psychological Science, 10(3), 267–281. doi:10.1177/1745691615577794

Model C — Equal Weight Linear Average (diagnostic baseline)

What it does: Plain arithmetic average across all platforms. The simplest possible aggregation rule.

Formula: consensus = mean(p_i)

Role: Diagnostic baseline. Ranjan & Gneiting (2010) proved that linear opinion pools are always miscalibrated (pulled toward 50%), so this model should not be used as a primary output. We include it as a sanity check: if Model A performs worse than this simple average on resolved data, something is broken.

Strengths: Trivially explainable. Standard baseline in the forecast combination literature (Clemen 1989).

Weaknesses: Provably miscalibrated (Ranjan & Gneiting 2010, Theorem 1). Over-shrinks tails (Baron et al. 2014). Ignores volume/liquidity.

Clemen, R. T. (1989). “Combining forecasts: A review and annotated bibliography.” International Journal of Forecasting, 5(4), 559–583. doi:10.1016/0169-2070(89)90012-5

Model D — Bayesian Track Record PENDING DATA

What it does: Weights each platform by its historical accuracy on resolved markets, computed in log-odds space to avoid the Ranjan-Gneiting miscalibration. Accuracy is measured by Brier score (Brier 1950). Lower Brier = more accurate = higher weight. Per-category scores allow the weighting to differ between sports, politics, and economics.

Formula: consensus = sigmoid(Σ w_i × logit(p_i) / Σ w_i), where w_i = 1 / BrierScore_i

Rationale: Bayesian Model Averaging (Hoeting et al. 1999) is the principled way to combine forecasts when you have evidence about each source’s reliability. The Good Judgment Project demonstrated that weighting by track record consistently outperforms equal weighting when sufficient calibration data exists (Mellers et al. 2014). This version applies weights in log-odds space rather than linear space to avoid the Ranjan-Gneiting miscalibration.

Status: Activation pending. We are collecting resolution data on every market we track. Activation requires 30+ resolved questions per platform per category (Smith & Wallis 2009 show that with K sources, O(K²) events are needed before estimated weights reliably beat equal weights). Until then this model returns “n/a” and the default remains Model A.

Brier, G. W. (1950). “Verification of forecasts expressed in terms of probability.” Monthly Weather Review, 78(1), 1–3. doi:10.1175/1520-0493(1950)078

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). “Bayesian Model Averaging: A Tutorial.” Statistical Science, 14(4), 382–401. doi:10.1214/ss/1009212519

Mellers, B., Ungar, L., Baron, J., et al. (2014). “Psychological strategies for winning a geopolitical forecasting tournament.” Psychological Science, 25(5), 1106–1115. doi:10.1177/0956797614524255

Source Quality & Inclusion Criteria

Not all prediction market sources are equal. A market with zero trading activity may show a stale price that reflects one trader's idiosyncratic belief rather than aggregated information. Including such sources in the consensus degrades accuracy rather than improving it.

The academic case for source filtering

The forecast aggregation literature consistently finds that curating which sources enter the pool improves aggregate accuracy:

Mannes, Soll & Larrick (2014) showed across 90 archival datasets that a “select crowd” of the top 5 judges — ranked by a cue to ability — produced more accurate forecasts than averaging the full crowd. Neither including everyone nor relying on a single best source is robust.
Budescu & Chen (2015) proposed a “contribution-weighted model” that eliminates sources making the aggregate worse. Using only “positive contributors” significantly outperformed both equal-weight and full-crowd models on 1,233 judges forecasting ~200 events.
Manski (2006) showed theoretically that thin-market prices are especially unreliable — with fewer traders, the price may reflect extreme beliefs rather than a representative average of informed opinion.
Wolfers & Zitzewitz (2006) confirmed that prediction market prices approximate mean beliefs under sufficient liquidity, but this relationship breaks down in thin markets.

Published minimum thresholds

No single canonical number exists in the literature, but several empirical studies provide reference points:

Criterion	Threshold	Source
Min trades per market	≥ 25	Restocchi et al. 2023, J. of Forecasting
Min unique traders	≥ 18–20	Dreber et al. 2015, PNAS 112(50)
Inactivity exclusion	No trades in 14 days	Rothschild 2009, Public Opinion Quarterly
Bid-ask spread	≤ 5 percentage points	Rothschild 2009

Our approach

We apply two filters before a source enters the consensus pool:

Participation gate: A source must have either volume > $0 or forecaster count > 0. Sources with zero participation on both measures are displayed on the question page as a data point but do not vote in the consensus calculation. This is more permissive than Dreber et al.’s ≥18 traders or Restocchi et al.’s ≥25 trades, but it eliminates the clearest failure mode: markets with no activity at all whose prices may be stale or reflect a single trader’s initial position.
Staleness exclusion (14 days): Following Rothschild (2009), any source that has not recorded a trade in the past 14 days is excluded from the consensus calculation. Stale prices reflect historical rather than current beliefs and degrade aggregate accuracy. The source is still shown on the question page for transparency, but its probability does not influence the consensus. If excluding stale sources would remove all sources for a question, we fall back to using all sources rather than showing no consensus.

Sources that pass both filters receive inverse-variance weight in Model A, where volume determines each platform’s weight (with shrinkage toward equal weights). Sources without volume data use forecaster count as a proxy.

Mannes, A. E., Soll, J. B., & Larrick, R. P. (2014). “The wisdom of select crowds.” Journal of Personality and Social Psychology, 107(2), 276–299. doi:10.1037/a0036677

Budescu, D. V., & Chen, E. (2015). “Identifying expertise to extract the wisdom of crowds.” Management Science, 61(2), 267–280. doi:10.1287/mnsc.2014.1909

Restocchi, V., McGroarty, F., & Sherrington, C. (2023). “Price formation in field prediction markets: the wisdom in the crowd.” Journal of Forecasting, 42(7). doi:10.1002/for.2974

Dreber, A., Pfeiffer, T., Almenberg, J., et al. (2015). “Using prediction markets to estimate the reproducibility of scientific research.” PNAS, 112(50), 15343–15347. doi:10.1073/pnas.1516179112

Rothschild, D. (2009). “Forecasting elections: comparing prediction markets, polls, and their biases.” Public Opinion Quarterly, 73(5), 895–916. doi:10.1093/poq/nfp082

Smith, J., & Wallis, K. F. (2009). “A simple explanation of the forecast combination puzzle.” Oxford Bulletin of Economics and Statistics, 71(3), 331–355. doi:10.1111/j.1468-0084.2008.00541.x

Data Pipeline

Our pipeline runs every 6 hours (00:00, 06:00, 12:00, 18:00 UTC). Each run:

Fetch — Pull fresh market data from all platform APIs (Polymarket, Manifold, PredictIt, Gemini). Each fetcher normalizes the raw API response into a common schema with probability, volume, last trade time, and metadata.
Import — Upsert into PostgreSQL. New markets are added; existing markets get their price, volume, and last-trade timestamp updated. Historical prices are stored in a separate price_snapshots table.
Match (daily at 03:00 UTC) — Our LLM-powered matcher identifies equivalent questions across platforms. Uses a two-phase LLM approach: fast initial screening, then a stronger model for validation of borderline cases and inversion detection.
Consensus — For each matched canonical question, compute all four models (A through D) and store in consensus_snapshots.
Resolution check — Query each platform’s API for recently resolved markets, record outcomes in the resolutions table, and compute per-platform Brier scores for Model D activation.

Total data: ~72,000 markets across 5 platforms, ~1,000 cross-platform canonical questions with consensus.

Limitations & Known Biases

We aim for transparency about what our consensus can and cannot do.

Garbage in, garbage out. Our consensus is only as good as the underlying market prices. If all platforms are wrong in the same direction (correlated bias), our aggregate will be wrong too. No aggregation method can overcome shared misinformation.
Small N. We aggregate across 4–5 platforms, not 1,000 forecasters. The statistical power of aggregation is limited. The forecast combination puzzle (Smith & Wallis 2009) is especially acute with small N, which is why we use shrinkage.
Volume ≠ accuracy. Our default model weights by volume, but Tetlock (2008) showed that liquidity does not always improve price efficiency. High-volume markets can be dominated by uninformed traders. Model B (equal weight) serves as a check.
Stale data. We update every 6 hours, not in real time. Markets can move significantly between refreshes, especially during breaking news. Our 14-day staleness filter (Rothschild 2009) catches chronically inactive markets but not 6-hour lag.
Matching errors. Our LLM matcher occasionally pairs questions with different timeframes or scopes. We use a two-pass LLM validation, but false matches can still affect consensus. We manually review and fix matches when users report them.
No resolution validation yet. We have not yet validated our models against resolution outcomes. Model D (Bayesian Track Record) will enable this once we accumulate sufficient resolved questions. Until then, our claim that Model A outperforms alternatives is based on the meta-analysis and forecasting literature, not our own empirical data.
Play money vs. real money. Manifold Markets uses play money. We include it because Dreber et al. (2015) found play-money prediction markets can be well-calibrated, but the absence of financial risk may reduce price accuracy on some questions.

Why Model A Is the Default

Four independent lines of evidence converge on log-odds pooling (geometric mean of odds) as the strongest non-extremized aggregation method:

Empirical tournament evidence. Satopää et al. (2014) tested aggregation methods on 1,300+ forecasters across 69 geopolitical prediction problems in the IARPA ACE tournament. Ranking by Brier score: extremized geometric mean of odds > non-extremized geometric mean of odds > arithmetic mean > median.
External Bayesianity. Genest (1984) proved that the logarithmic opinion pool (of which the geometric mean of odds is a special case) is the only pooling method satisfying External Bayesianity: the order in which you aggregate forecasts and update on new evidence does not matter. The linear opinion pool does not have this property.
Calibration theory. Ranjan & Gneiting (2010) proved (Theorem 1) that any non-trivial weighted linear combination of two or more distinct, calibrated probability forecasts is necessarily uncalibrated and lacks sharpness. This is a fundamental theoretical flaw of arithmetic averaging (Model D). Log-odds pooling does not suffer from this deficiency.
GJP validation. The Good Judgment Project — which won both years of the IARPA ACE tournament, outperforming all competing research teams by 35–72% — used log-odds averaging as the core of its aggregation pipeline (Mellers et al. 2014).

Future improvement: extremization

Baron et al. (2014) showed that extremizing — multiplying the averaged log-odds by a factor d ≈ 2.0 before converting back to probability — improves accuracy for all aggregation methods including log-odds pooling. The improvement corrects two biases: (1) compression from averaging pushes the aggregate toward 50%, and (2) forecasters with shared information sources under-adjust for the collective evidence available to the group.

The optimal d depends on information overlap among sources. Prediction markets likely share high information overlap (they all react to the same news), so d = 2.0 may be too aggressive. We plan to calibrate d once we have sufficient resolution data, using the same Brier score approach described under Model D.

Genest, C. (1984). “A characterization theorem for externally Bayesian groups.” Annals of Statistics, 12(3), 1100–1105. doi:10.1214/aos/1176346731

Ranjan, R., & Gneiting, T. (2010). “Combining probability forecasts.” Journal of the Royal Statistical Society, Series B, 72(1), 71–91. doi:10.1111/j.1467-9868.2009.00726.x

Atanasov, P., Rescober, P., Stone, E., et al. (2016). “Distilling the wisdom of crowds: prediction markets vs. prediction polls.” Management Science, 63(3), 691–706. doi:10.1287/mnsc.2015.2374

Quality Assurance Pipeline

Before any model runs, every question goes through a 5-phase matching and validation pipeline:

Parent event matching: Multi-outcome events (e.g., "2028 GOP nominee") are matched at the event level first, so candidates can be aligned across platforms.
Candidate alignment: Within matched parent events, individual candidates are paired across platforms using slug and title similarity.
LLM-verified binary matching: For binary questions, candidate pairs are verified by an LLM using the title and resolution criteria.
Star graph merge: Prevents false transitive merges (e.g., A↔B and B↔C should not always imply A↔C). Uses a connected-component check on the match graph.
LLM validation: Final pass with a stronger LLM catches inverted phrasings ("Trump wins" vs "Trump loses"), scope mismatches, and timeframe mismatches.

Same-platform duplicates — e.g., a market relisted under a slightly different title — are merged using volume-weighted averaging before consensus is computed. Each platform contributes exactly one vote per question.

Retired Models

Three models from our original six-model lineup were retired in April 2026:

Old Model A (Liquidity-Weighted Linear) — removed because Ranjan & Gneiting (2010) proved that linear opinion pools are always miscalibrated. The volume-weighting idea was absorbed into the new Model A using the correct log-odds framework.
Old Model C (Threshold Imputation) — reclassified as preprocessing logic for multi-outcome events. Had no published precedent; the imputation threshold was ad hoc.
Old Model F (Volume-Weighted Log-Odds) — superseded by the new Model A, which uses principled inverse-variance weights (DerSimonian & Laird 1986) instead of ad-hoc sqrt(volume), plus shrinkage to protect against the forecast combination puzzle.

Full reference list

Atanasov, P., Rescober, P., Stone, E., et al. (2016). “Distilling the wisdom of crowds: prediction markets vs. prediction polls.” Management Science, 63(3), 691–706.

Baron, J., Mellers, B. A., Tetlock, P. E., Stone, E., & Ungar, L. H. (2014). “Two reasons to make aggregated probability forecasts more extreme.” Decision Analysis, 11(2), 133–145.

Brier, G. W. (1950). “Verification of forecasts expressed in terms of probability.” Monthly Weather Review, 78(1), 1–3.

Budescu, D. V., & Chen, E. (2015). “Identifying expertise to extract the wisdom of crowds.” Management Science, 61(2), 267–280.

Clemen, R. T. (1989). “Combining forecasts: A review and annotated bibliography.” International Journal of Forecasting, 5(4), 559–583.

Dreber, A., Pfeiffer, T., Almenberg, J., et al. (2015). “Using prediction markets to estimate the reproducibility of scientific research.” PNAS, 112(50), 15343–15347.

Genest, C. (1984). “A characterization theorem for externally Bayesian groups.” Annals of Statistics, 12(3), 1100–1105.

Genest, C., & Zidek, J. V. (1986). “Combining Probability Distributions: A Critique and an Annotated Bibliography.” Statistical Science, 1(1), 114–135.

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). “Bayesian Model Averaging: A Tutorial.” Statistical Science, 14(4), 382–401.

Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley.

Mannes, A. E., Soll, J. B., & Larrick, R. P. (2014). “The wisdom of select crowds.” Journal of Personality and Social Psychology, 107(2), 276–299.

Manski, C. F. (2006). “Interpreting the predictions of prediction markets.” Economics Letters, 91(3), 425–429.

Mellers, B., Stone, E., Murray, T., et al. (2015). “Identifying and cultivating superforecasters as a method of improving probabilistic predictions.” Perspectives on Psychological Science, 10(3), 267–281.

Mellers, B., Ungar, L., Baron, J., et al. (2014). “Psychological strategies for winning a geopolitical forecasting tournament.” Psychological Science, 25(5), 1106–1115.

Raftery, A. E., Gneiting, T., Balabdaoui, F., & Polakowski, M. (2005). “Using Bayesian Model Averaging to Calibrate Forecast Ensembles.” Monthly Weather Review, 133(5), 1155–1174.

Ranjan, R., & Gneiting, T. (2010). “Combining probability forecasts.” Journal of the Royal Statistical Society, Series B, 72(1), 71–91.

Restocchi, V., McGroarty, F., & Sherrington, C. (2023). “Price formation in field prediction markets: the wisdom in the crowd.” Journal of Forecasting, 42(7).

Rothschild, D. (2009). “Forecasting elections: comparing prediction markets, polls, and their biases.” Public Opinion Quarterly, 73(5), 895–916.

Rubin, D. B. (1976). “Inference and missing data.” Biometrika, 63(3), 581–592.

Satopää, V. A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014). “Combining multiple probability predictions using a simple logit model.” International Journal of Forecasting, 30(2), 344–356.

Smith, J., & Wallis, K. F. (2009). “A simple explanation of the forecast combination puzzle.” Oxford Bulletin of Economics and Statistics, 71(3), 331–355.

Wolfers, J., & Zitzewitz, E. (2004). “Prediction Markets.” Journal of Economic Perspectives, 18(2), 107–126.

Wolfers, J., & Zitzewitz, E. (2006). “Interpreting prediction market prices as probabilities.” NBER Working Paper No. 12200.