Data Scientist Benchmark

Overview

4 frontier agents evaluated across 500+ expert-authored tasks spanning 8 data science domains — from statistical modeling and time series to NLP and geospatial analysis.

Tasks are calibrated hard: 48% target <15% pass rate, with an average of 18.4 verifiers per task checking granular output criteria, not just pass/fail.

Claude Code leads at 38% pass@1, with the widest margin on multi-step statistical pipelines. All agents share a ceiling on strict float-precision and robustness scenarios.

Leaderboard

Ranked by pass@1. Cost and time per-task avg.

Agent
Pass@1
Pass@4
Cost
Time
01
Claude CodeSonnet 4.5
38%
75%
$1.20
42 min
02
Gemini CLIGemini 3 Pro
29%
63%
$0.85
38 min
03
CodexGPT-5
22%
52%
$1.45
52 min
04
TerminusGemini 3 Pro
12%
33%
$0.62
15 min

Pass@k

Unbiased estimator, n=4 trials. Higher k = more attempts per task.

0%25%50%75%100%k=1k=2k=3k=4
Claude Code
Gemini CLI
Codex
Terminus

Top Failure Modes

Most common reasons agents fail, aggregated across all trials.

01
Memory Limit Exceededcritical
×14 occurrences
02
Null Handling Errorcritical
×11 occurrences
03
Timeout — KNN Imputationwarn
×8 occurrences
04
Wrong Output Schemawarn
×6 occurrences
05
Plot Not Renderedinfo
×4 occurrences
Abundant Data — v2.0.4Ref: 22-09
System Live