Data Scientist Benchmark
Overview
4 frontier agents evaluated across 500+ expert-authored tasks spanning 8 data science domains — from statistical modeling and time series to NLP and geospatial analysis.
Tasks are calibrated hard: 48% target <15% pass rate, with an average of 18.4 verifiers per task checking granular output criteria, not just pass/fail.
Claude Code leads at 38% pass@1, with the widest margin on multi-step statistical pipelines. All agents share a ceiling on strict float-precision and robustness scenarios.
Leaderboard
Ranked by pass@1. Cost and time per-task avg.
Agent
Pass@1
Pass@4
Cost
Time
01
Claude CodeSonnet 4.5
38%
75%
$1.20
42 min
02
Gemini CLIGemini 3 Pro
29%
63%
$0.85
38 min
03
CodexGPT-5
22%
52%
$1.45
52 min
04
TerminusGemini 3 Pro
12%
33%
$0.62
15 min
Pass@k
Unbiased estimator, n=4 trials. Higher k = more attempts per task.
Claude Code
Gemini CLI
Codex
Terminus
Top Failure Modes
Most common reasons agents fail, aggregated across all trials.
01
Memory Limit Exceededcritical
×14 occurrences
02
Null Handling Errorcritical
×11 occurrences
03
Timeout — KNN Imputationwarn
×8 occurrences
04
Wrong Output Schemawarn
×6 occurrences
05
Plot Not Renderedinfo
×4 occurrences