Leaderboards

Rigorous benchmarks, not cherry-picked results.

Design custom evaluations that measure your specified model capabilities.

Evaluating AI Agents on Software Engineering

Assessing AI agents across real-world software engineering workflows—measuring how models navigate, reason, and execute complex development tasks.

IDE-BenchJan 20, 2026

Introductory Quantitative Trading

Evaluating AI models on real-world market scenarios—measuring how they reason, predict, and make decisions under dynamic conditions.

Market-BenchDec 13, 2025

AI Web App Generation

A benchmark for evaluating how well AI coding agents can generate real web apps from a single natural language prompt. One-shot generations. Zero human edits.

App-BenchOct 25, 2025

FinanceQA, Assumption-Based

Analyzing AI models on real-world financial analysis—measuring how they reason, interpret data, and make decisions under uncertainty.

FinanceArenaJan 30, 2025