Hey HN! I'm Mert.

I built this because I was frustrated with LLM benchmarks potentially being contaminated by training data. When a model scores 99.9% on MMLU-Pro-Max, we can't tell if that's genuine reasoning or memorization.

Forecaster Arena tries to solve this by testing models on events that haven't happened yet—real prediction markets from Polymarket. The ground truth is reality itself, weeks or months later.

How it works:

7 frontier LLMs (GPT-5.1, Claude Opus 4.5, Gemini, Grok, DeepSeek, etc.) (will be updated) -> Each gets $10k virtual capital weekly -> They bet on 500+ real prediction markets -> Bet size = confidence (larger bet = more confident) -> We measure calibration (Brier score) + returns (P/L)

Currently running first cohort (started Dec 7). First statistically significant analysis expected over the next few weeks.

Everything is open source (MIT): https://github.com/setrf/forecasterarena

Happy to answer questions about the implementation or trade-offs I made. Would be great to hear your feedback on the methodology as well!