Coding Benchmark

Overview

1,000+ expert-authored coding tasks spanning 13 domains. 100% of tasks are human made/reviewed and converted from real open tasks in real repos.

Dataset's mean solution complexity is 248 LOC (2.3× SWE-Bench Pro). Pass@1 range across agents: 25–44%, with 40% of tasks never solved by any agent despite being human-built and reviewed for task quality.

Ships in Harbor, which lets teams run, score, and iterate on agents easily. Harbor is built by the creators of Terminal-Bench and used by virtually every frontier lab.

Mean Solution LOC

248
2.3× SWE-Bench Pro's 107

Empirically Difficult

75%
Unsolved by 3+ frontier agents

Verified Solutions

100%
Every task ships with a passing ground-truth solution

Real world examples

100%
Converted from real repo open tasks

Leaderboard

Ranked by pass@1. Cost and time per-task avg.

RankAgentPass@1Pass@3CostTime
01
Codex
GPT-4.5
44%63%$1.8548 min
02
Terminus
Gemini 1 Flash
41%55%$0.3015 min
03
Gemini CLI
Gemini 1 Flash
39%45%$0.7235 min
04
Claude Code
Haiku 4.5
25%37%$0.3927 min

Task Type Split

Even split across benchmark styles.

50%
50%
SWE-Bench Style
Real bug fixes in open-source repos — LiteLLM, ADK, n8n, and more
Terminal-Bench Style
Standalone systems challenges — distributed computing, networking, infrastructure

Difficulty Distribution

Breakdown by frontier agent solve rate.

40%
35%
15%
10%
Hard0 agents pass
400 tasks
Medium-Hard1 agent passes
350 tasks
Medium2 agents pass
150 tasks
Easier3–4 agents pass
100 tasks

Domain Distribution

Task allocation across 13 domains.

Web Frameworks 14%Distributed Systems 12%LLM/AI Systems 11%Multi-Tenant SaaS 9%Networking 8%OSS Bug Fixes 8%Financial/Quant 7%DevOps/K8s 6%Integration 6%Security 5%Data Structures 5%Stream Processing 5%Game Logic 4%

Pipeline Status

From sourcing to benchmark availability.

DONE
1,000+ expert-authored coding tasks from senior engineers
DONE
4 agents × 3 trials, verifier validation, verified solution/NOP baselines
NOW
Available now — full results, pass@k, trajectory analysis
+2–4 WEEKS
Domain-specific tasks built to your requirements