Abundant — Multi-step Data Science Evals

500+ multi-step data science tasks, each with a Docker environment, programmatic verifiers, and RL-ready training signal.
Trusted by 2 of the top 3 AI labs.

The problem

Most data science evals give you bad signal.

Tasks don't reflect real-world distributions
Environments aren't realistic or complex enough
Verifiers are over-specified, under-specified, or hackable
Result: Invalid RL training signal and evals you can't trust.

Our approach

We believe the unlock to safe, reliable intelligence is simulation — environments where agents plan, act, and face real consequences. We specialize in highly technical domains with long task horizons and complex tooling. Quality is built in through counterfactual simulation: we run dozens of adversarial trials per task to empirically measure and eliminate deficiencies, on top of multiple layers of manual expert review.

Our methodology

01

Source

High-touch, outbound-first hiring via human sourcing and interviewing by domain experts. Task creators are trained in-house over several weeks. No AI screeners and no crowdsourcing.

Hand-vetted · real industry experience

02

Build

First sample in under 24 hours. 1,000 custom RL environments in 4 weeks.

500+ tasks ready now · 6–28 day horizons for custom work

03

Validate

Human expert review + agent sandbox testing. Failures reflect model capability, not task ambiguity.

Agentic trace validation · 20–40% quality uplift

Our latest dataset

Solve time

2–10 hrs

Per task, skilled data scientist

Frontier pass rate

10–40%

pass@k difficulty

Task depth

5–80+

Requirements per task

Task domains

Data Science Machine Learning Data & Visualization Scripting & Automation Colab Maintenance Research & Education Custom on request

Ready to get started?

500+ verified tasks ready to deploy immediately.

Request a sample Book a call →

What our research partners say about us

"Abundant's format was the easiest for us to ingest. When we hit limitations with our training infrastructure, they quickly adapted."

Product LeadershipGoogle DeepMind

"Turned around high-quality data super fast — sometimes within 24 hours. Other vendors took months to finalize spec. Abundant did it in weeks."

Founder/CEOFirecrawl

"We showed Abundant's annotation platform to other teams, and everyone immediately wanted access. What started as a pilot for my team became the standard across our research org."

Research ScientistAdobe Research

Leadership

Jesse Hu

Co-Founder, ex-Waymo MLE

Ex-Waymo ML Engineer
TL for Data Quality & Evals
Co-author on Terminal Bench

Ke Huang

Co-Founder, Ops Lead

Google Assistant Eval Lead
GDM Trust & Safety
Brex Compliance & Ops Lead

Rishi Desai

Research Lead

Coding agents since 2023
Creator of SWE-Gen
Contributor to Terminal Bench

Meji Abidoye

Co-Founder, Quality and Infra Lead

Ex-AWS Infrastructure Lead
Early contributor to Terminal Bench and Harbor

Sample talent profiles — anonymized for privacy

Data Scientist

Enterprise Technology 4+ years

Anomaly detection on 500k+ server log entries/day
Classification models for malicious vs. normal user behavior
Time-series forecasting for server load spike prediction

AI/ML Lead

Computer Vision & GenAI 6+ years

Led end-to-end projects across NLP, vision, and generative AI
Biometric systems: face recognition, spoof detection, 3D reconstruction
Predictive models, anomaly detection, and high-compute algorithms in production

Data Scientist II

Global Payments 5+ years

Fraud detection and transaction optimization on high-volume payment data
Explainable AI (SHAP) for root cause analysis at scale
End-to-end ML from feature engineering to production deployment

Data Scientist

Energy MS Analytics

Foundation-model computer vision systems with VLM-powered inspection
ML-driven algorithmic energy trading systems
LLM-powered intelligence pipelines for executive decision-making

Ready to get started?

500+ verified tasks ready to deploy immediately.

Request a sample Book a call →

Task prompt

Abridged from 277-line instruction

Build a notebook at /app/analysis.ipynb that analyzes marketing attribution collapse under channel saturation using data/daily_marketing_attribution.csv. Write all outputs to /app/results/.

Required cleaning (exact order): Remove test rows where is_test_row == 1. Parse date and snapshot_ts strictly; drop invalid rows. Drop rows where spend, impressions, clicks, conversions, or revenue violate validity constraints (e.g., clicks > impressions, conversions > clicks). Deduplicate by key (date, region, channel, segment, campaign_id) using a deterministic priority: latest snapshot, then higher revenue, then higher conversions.

Required metrics (metrics.json): Compute 7 metric sections: cleaning_audit, saturation_breakpoints (first date each channel×region hits saturation), lag_response_surface (Pearson correlation of spend vs. future conversion rate at lags 0–3, using 7-day rolling mean), channel_collapse_index (pre- vs. post-saturation ROI delta), mix_adjusted_roi_gap (baseline channel mix from first 30 dates), counterfactual_unsaturated_uplift (35% conversion recovery per unit of over-saturation), and robustness_scenarios (base, exclude social, drop shock window 2024-03-10 to 2024-03-24).

Required visualizations: Produce 4 publication-quality figures: a saturation ratio heatmap (channel-region × month, using sns.heatmap with pivot_table), a lag response surface (per-channel lines with markers), a pre- vs. post-saturation ROI frontier (scatter with diagonal y=x), and a daily ROI timeline by region (with axvspan for shock and rebound windows). Also produce insights.md, run_manifest.json, and plot_contract.json for deterministic plot-data verification.

Docker container

Jupyter notebook + daily_marketing_attribution.csv (10,000+ rows, 3 regions, 4 channels, multiple segments). Saturation thresholds fixed per channel×region.

Python 3.11 2 CPUs / 4 GB RAM 900s timeout

Installed packages

pandas numpy matplotlib seaborn jupyter nbconvert nbformat pytest

Results — verifier groups passed (4 groups)

Codex GPT-5.2

4/4

Gemini CLI 3.1 Pro

4/4

Mini-SWE 3.1 Pro

3/4

Claude Code Opus 4.6

2/4

NOP Baseline no-op

0/4

Agent trajectory — Claude Code (Opus 4.6)

Claude Code (Opus 4.6)

37 steps

6m 27s

PASS 2/4

#0 system init

#1 tool calls

#2 TodoWrite — plan

#3 Bash: ls /app/data/

#4 Bash: head -5 CSV

#5 Bash: wc -l

#6 Bash: python3 explore

#7 TodoWrite — explore ✓

#8 THINK

#9 Bash: mkdir

#10 Write FAIL

#11 Bash: ls

#12 Read: notebook

#13 THINK

#14 Write: 19-cell notebook

#15 tool calls

#16 Bash: nbconvert --execute

#17 success

#18 Bash: ls results/

#19 verify metrics

#20 verify manifest

#21 tool calls

#22 "uplift is 0.0 — suspicious"

#23 investigate

#24

#25

#26

#27 check sat

#28

#29 "no rows saturated"

#30 verify plots

#31 validation

#32 copy back

#33 Bash: cp

#34 TodoWrite ✓

#35 summary

#36 complete

#0 · 0:00

SYS

Session init

#2 · 0:06

Plan

4 tasks: Explore → Build notebook → Execute → Validate

#3 · 0:12

Bash

ls /app/data/

→ daily_marketing_attribution.csv

#6 · 0:30

Bash

Explore CSV

→ (15460, 12) · 3 regions · 4 channels · finds "bad-ts"

#10 · 0:55

Write

/app/analysis.ipynb

✗ "File has not been read yet"

#14 · 1:29

Write

Full 19-cell notebook — cleaning, metrics, visualizations, contracts

#16 · 2:15

Bash

jupyter nbconvert --execute

→ 10 output files generated

21 more steps — investigates suspicious metrics, validates outputs.

Request full trajectories & other agent runs →

Rubric — 27 automated verifiers

Evaluation criteria

01metrics.json has exactly 7 required top-level keys

02cleaning_audit values match oracle

03saturation_breakpoints match oracle

04lag_response_surface correlations match oracle

05channel_collapse_index ROI deltas match oracle

06mix_adjusted_roi_gap values match oracle

07counterfactual_unsaturated_uplift match oracle

08robustness_scenarios (3 variants) match oracle

09Non-trivial rows removed during cleaning

10insights.md exists and is non-empty

114 PNG figures exist with minimum dimensions

12Figures have visual complexity (unique colors)

13plot_contract.json matches oracle exactly

14All outputs generated fresh during run

15run_manifest.json matches contract

16Notebook uses pivot_table for heatmap

17Notebook uses np.corrcoef for correlations

18Uses rolling(window=7, min_periods=7)

19Uses axvspan for window shading

20Uses sns.heatmap with annot=True

21Uses scatter for frontier plot

22Has diagonal y=x reference line

23No outsourced logic or preloaded results

24Heatmap has numeric annotations

25Frontier has scatter points + diagonal

26Timeline has per-region lines + spans

27Lag surface has per-channel lines

pass@k at least 1 of k passes

65% 50% 35%

@1 @2 @3 @4 @5

62.1%

62.1%

55.2%

55.2%

pass^k all k attempts pass

45% 25% 5%

^1 ^2 ^3 ^4 ^5

27.6%

24.1%

17.2%

3.4%

Gemini CLI

Claude Code

Codex

Mini SWE

BY TOPIC

Optimization & OR

21%

ML & Data Science

14%

ML Interpretability

10%

E-commerce Analytics

10%

Finance & Audit

10%

Computer Vision

10%

Data Engineering

7%

Graph & Network

7%

Other

10%

BY JOB FUNCTION

Data Scientist / ML Eng.

41%

Optimization Engineer

21%

Analytics Engineer

17%

Computer Vision Eng.

10%

Data / Platform Eng.

10%

BY FAILURE MODE — Why frontier models fail

Domain Expertise

24%

Multi-step Pipelines

10%

Math & Algorithmic Reasoning

21%

Instruction Following

10%

Visual Understanding

17%

Tool & Library Fluency

10%

Debugging & ML Practices

7%

Ready to get started?

500+ verified tasks ready to deploy immediately.

Request a sample Book a call →