Research

Data quality makes all the difference.

We’re driven by the conviction that model performance is fundamentally bounded by training data quality. Through expert collaboration, rigorous curation methodologies, and deep domain expertise, we research datasets that power tomorrow’s models.

SpreadsheetBench 2

Evaluating LLM agents on challenging, expert-curated, end-to-end spreadsheet tasks — financial modeling, debugging, and visualization in complex multi-sheet workbooks.

View benchmark

IDE-Bench

A comprehensive framework for evaluating AI IDE agents on real-world software engineering tasks through an IDE-native tool interface.

Read paper

How AfterQuery Helped NVIDIA Hill-Climb GDPval

Spencer M.Carlos G.Jul 2, 2026

Read blog

How we achieved a net win-loss margin of +21.4% on GDPval with on-policy distillation

Michael E.Spencer M.Jun 8, 2026

Read blog

Why DeployCo and ServiceCo Are Betting on the Last Mile

Sam J.Agustin G.DrewJun 3, 2026

Read blog

Solving the Last Mile Problem in Partnership with The Raine Group

Carlos G.Sam J.Apr 28, 2026

Read blog

Human expertise, reimagined

Spencer M.Apr 9, 2026

Read blog

How AfterQuery Expert Data Drives Model Performance on τ²-bench

Michael E.Spencer M.Arya F.Apr 8, 2026

Read blog

How We Improved Terminal-Bench 2.0 Scores by Over 5x Using Tinker and Harbor

Spencer M.Michael E.Carlos G.Mar 31, 2026

Read blog

IDE-Bench: Evaluating Large Language Models as IDE Agents

Spencer M.Jeff Y.Tiana C.Jan 20, 2026

Read paper

Market-Bench: Evaluating LLMs on Introductory Quantitative Trading

Abhay S.Sam J.Spencer M.Dec 13, 2025

Read paper

App-Bench: Evaluating Coding Agents on Generating Economically Useful Web-Apps

Andrew Z.Sam J.Spencer M.Oct 25, 2025

Read paper

The AfterQuery Thesis

Spencer M.Oct 20, 2025

Read blog

UI-Bench: A Benchmark for Evaluating User Interface Understanding

Sam J.Agustin G.Spencer M.Aug 28, 2025

Read paper

FinanceQA: A Benchmark for Assumption-Based Financial Analysis

Spencer M.Sam J.Jan 30, 2025

Read paper

Core Research Areas

Computer Use

We’ve created training data and reinforcement learning environments that teach AI agents to navigate real software workflows end-to-end, capturing judgment calls and edge cases that only experienced practitioners recognize.