SpreadsheetBench Verified: A Curated Evaluation Set

In collaboration with the original authors, we release a curated subset of 400 human-validated tasks optimized for reliable automated evaluation.

SpreadsheetBench is the standard public benchmark for evaluating spreadsheet agents—with SOTA improving from 20% to 68.9% over the past year. As agents approach human-level performance, reliable evaluation becomes critical.

In collaboration with the original SpreadsheetBench authors, we're releasing SpreadsheetBench Verified: 400 human-validated tasks optimized for automated evaluation. We hope this contribution helps the research community measure progress more reliably.

Why Curation Was Needed

SpreadsheetBench's strength is its realism—tasks come from actual Excel forum questions with all the messiness of real-world spreadsheets. But this realism creates evaluation challenges. SpreadsheetBench evaluates by comparing the agent's output cells against expected values in a golden file—any mismatch counts as failure. This strict matching requires tasks to have exactly one correct answer, but many real-world questions don't:

Underspecified instructions — ambiguous wording, unclear handling of edge cases, unspecified date formats
Multiple valid interpretations — tasks where reasonable agents could produce different correct answers
Volatile functions — RAND(), TODAY(), NOW() produce non-deterministic outputs
Formatting requirements — colors, borders, fonts that can't be verified programmatically

Over half the original tasks required clarification or modification for reliable automated scoring.

Curation Process

We ran a four-layer review pipeline to maximize clarity and reproducibility while preserving the benchmark's character.

L0 — Automated Consistency Testing

We ran Shortcut on each task multiple times to flag ambiguities or inconsistencies between the initial spreadsheet and expected output files. Tasks with stable results across runs that still failed evaluation were prioritized for human review, enabling us to systematically identify evaluation issues at scale.

L1 — External Spreadsheet Specialists

We recruited dozens of spreadsheet professionals and assessed their ability to navigate real SpreadsheetBench tasks. Only about 20% passed our qualification assessment and were hired as annotators. These specialists reviewed every task, recommending clarifications, targeted corrections, or removal when a task could not be made reliably evaluable.

L2 — Internal Expert Review

Our internal data team validated and refined the L1 proposals, clarifying ambiguous edits and removing tasks that remained unverifiable even after revision.

L3 — Final Review by SpreadsheetBench Authors

The original SpreadsheetBench authors reviewed the full curated set, ensuring edits preserved the benchmark's intent. They also removed tasks deemed too easy using a simple spreadsheet agent scaffold.

The Dataset

Stage	Tasks	Notes
Original	912	SpreadsheetBench dataset
After verification	696	382 modified (375 queries, 93 golden files, 45 initial files), 216 removed
Final	400	296 filtered by SpreadsheetBench team (too easy)

Validation

To confirm the curation produced well-specified, solvable tasks, we ran Shortcut on all 400 tasks (4 attempts each). The 98.2% pass@4 indicates these tasks have unambiguous correct answers—the goal of this curation effort.

The gap between average (86%) and pass@4 (98.2%) highlights that reliable single-attempt performance remains an open challenge, even on well-specified tasks.

100%

90%

80%

70%

Run 1

85.3%

Run 2

84.8%

Run 3

85.1%

Run 4

88.4%

Avg

86%

Pass@4

98.2%

Pass@4: best score across 4 attempts per task

Conclusion

Spreadsheet agents are solving tasks that seemed nearly impossible only a year ago. SpreadsheetBench Verified is a contribution back to the research ecosystem—a cleaner, more reliably evaluable version of a benchmark we rely on ourselves.

Important limitations remain. The benchmark only evaluates whether cell values are correct—but real-world spreadsheet work demands much more: formula correctness and elegance, logical placement of results, pivot tables, charts, and clear formatting. These capabilities are essential for generating economic value, as professionals need spreadsheets that are not just accurate but also maintainable, auditable, and presentation-ready. Developing programmatic evaluation for these dimensions—and generating sufficiently difficult and realistic tasks—remain open problems.

We encourage other teams to evaluate their agents on this dataset and report results.

A note on selection bias: prioritizing evaluability over difficulty preservation may make this subset easier than the original benchmark overall.

Contributors: Peter Wang, Richard Pham, Frankie Li, Jeremy Flint, Feitong Yang, Shuying Luo, Nico Christie, Robert Yang

Special thanks: Jordi Lorido, Carli Cooperstein, Zeyao Ma

Resources