Coding Benchmark

Overview

1,000+ expert-authored coding tasks spanning 13 domains. 100% of tasks are human made/reviewed and converted from real open tasks in real repos.

Dataset's mean solution complexity is 248 LOC (2.3× SWE-Bench Pro). Pass@1 range across agents: 25–44%, with 40% of tasks never solved by any agent despite being human-built and reviewed for task quality.

Ships in Harbor, which lets teams run, score, and iterate on agents easily. Harbor is built by the creators of Terminal-Bench and used by virtually every frontier lab.

Mean Solution LOC

248

2.3× SWE-Bench Pro's 107

Empirically Difficult

75%

Unsolved by 3+ frontier agents

Verified Solutions

100%

Every task ships with a passing ground-truth solution

Real world examples

100%

Converted from real repo open tasks

Leaderboard

Ranked by pass@1. Cost and time per-task avg.

Rank	Agent	Pass@1	Pass@3	Cost	Time
01	Codex GPT-4.5	44%	63%	$1.85	48 min
02	Terminus Gemini 1 Flash	41%	55%	$0.30	15 min
03	Gemini CLI Gemini 1 Flash	39%	45%	$0.72	35 min
04	Claude Code Haiku 4.5	25%	37%	$0.39	27 min

Task Type Split

Even split across benchmark styles.

50%

SWE-Bench Style

Real bug fixes in open-source repos — LiteLLM, ADK, n8n, and more

Terminal-Bench Style

Standalone systems challenges — distributed computing, networking, infrastructure

Difficulty Distribution

Breakdown by frontier agent solve rate.

40%

35%

15%

10%

Hard0 agents pass

400 tasks

Medium-Hard1 agent passes

350 tasks

Medium2 agents pass

150 tasks

Easier3–4 agents pass

100 tasks

Domain Distribution

Task allocation across 13 domains.

Web Frameworks 14%Distributed Systems 12%LLM/AI Systems 11%Multi-Tenant SaaS 9%Networking 8%OSS Bug Fixes 8%Financial/Quant 7%DevOps/K8s 6%Integration 6%Security 5%Data Structures 5%Stream Processing 5%Game Logic 4%

Pipeline Status

From sourcing to benchmark availability.

DONE

1,000+ expert-authored coding tasks from senior engineers

DONE

4 agents × 3 trials, verifier validation, verified solution/NOP baselines

NOW

Available now — full results, pass@k, trajectory analysis

+2–4 WEEKS

Domain-specific tasks built to your requirements