Coding Benchmark
Overview
1,000+ expert-authored coding tasks spanning 13 domains. 100% of tasks are human made/reviewed and converted from real open tasks in real repos.
Dataset's mean solution complexity is 248 LOC (2.3× SWE-Bench Pro). Pass@1 range across agents: 25–44%, with 40% of tasks never solved by any agent despite being human-built and reviewed for task quality.
Ships in Harbor, which lets teams run, score, and iterate on agents easily. Harbor is built by the creators of Terminal-Bench and used by virtually every frontier lab.
Mean Solution LOC
248
2.3× SWE-Bench Pro's 107
Empirically Difficult
75%
Unsolved by 3+ frontier agents
Verified Solutions
100%
Every task ships with a passing ground-truth solution
Real world examples
100%
Converted from real repo open tasks
Leaderboard
Ranked by pass@1. Cost and time per-task avg.
| Rank | Agent | Pass@1 | Pass@3 | Cost | Time |
|---|---|---|---|---|---|
| 01 | Codex GPT-4.5 | 44% | 63% | $1.85 | 48 min |
| 02 | Terminus Gemini 1 Flash | 41% | 55% | $0.30 | 15 min |
| 03 | Gemini CLI Gemini 1 Flash | 39% | 45% | $0.72 | 35 min |
| 04 | Claude Code Haiku 4.5 | 25% | 37% | $0.39 | 27 min |
Task Type Split
Even split across benchmark styles.
50%
50%
SWE-Bench Style
Real bug fixes in open-source repos — LiteLLM, ADK, n8n, and more
Terminal-Bench Style
Standalone systems challenges — distributed computing, networking, infrastructure
Difficulty Distribution
Breakdown by frontier agent solve rate.
40%
35%
15%
10%
Hard0 agents pass
400 tasksMedium-Hard1 agent passes
350 tasksMedium2 agents pass
150 tasksEasier3–4 agents pass
100 tasksDomain Distribution
Task allocation across 13 domains.
Web Frameworks 14%Distributed Systems 12%LLM/AI Systems 11%Multi-Tenant SaaS 9%Networking 8%OSS Bug Fixes 8%Financial/Quant 7%DevOps/K8s 6%Integration 6%Security 5%Data Structures 5%Stream Processing 5%Game Logic 4%
Pipeline Status
From sourcing to benchmark availability.
DONE
1,000+ expert-authored coding tasks from senior engineers
DONE
4 agents × 3 trials, verifier validation, verified solution/NOP baselines
NOW
Available now — full results, pass@k, trajectory analysis
+2–4 WEEKS
Domain-specific tasks built to your requirements