Data Scientist Benchmark

Overview

4 frontier agents evaluated across 500+ expert-authored tasks spanning 8 data science domains — from statistical modeling and time series to NLP and geospatial analysis.

Tasks are calibrated hard: 48% target <15% pass rate, with an average of 18.4 verifiers per task checking granular output criteria, not just pass/fail.

Claude Code leads at 38% pass@1, with the widest margin on multi-step statistical pipelines. All agents share a ceiling on strict float-precision and robustness scenarios.

Leaderboard

Ranked by pass@1. Cost and time per-task avg.

Agent

Pass@1

Pass@4

Cost

Time

Claude CodeSonnet 4.5

38%

75%

$1.20

42 min

Gemini CLIGemini 3 Pro

29%

63%

$0.85

38 min

CodexGPT-5

22%

52%

$1.45

52 min

TerminusGemini 3 Pro

12%

33%

$0.62

15 min

Pass@k

Unbiased estimator, n=4 trials. Higher k = more attempts per task.

Claude Code

Gemini CLI

Codex

Terminus

Top Failure Modes

Most common reasons agents fail, aggregated across all trials.

Memory Limit Exceededcritical

×14 occurrences

Null Handling Errorcritical

×11 occurrences

Timeout — KNN Imputationwarn

×8 occurrences

Wrong Output Schemawarn

×6 occurrences

Plot Not Renderedinfo

×4 occurrences

Abundant Data — v2.0.4Ref: 22-09

System Live