Abundant
View a sample task →

500+ multi-step data science tasks, each with a Docker environment, programmatic verifiers, and RL-ready training signal.
Trusted by 2 of the top 3 AI labs.

The problem
Most data science evals give you bad signal.
  • Tasks don't reflect real-world distributions
  • Environments aren't realistic or complex enough
  • Verifiers are over-specified, under-specified, or hackable
  • Result: Invalid RL training signal and evals you can't trust.
Our approach

We believe the unlock to safe, reliable intelligence is simulation — environments where agents plan, act, and face real consequences. We specialize in highly technical domains with long task horizons and complex tooling. Quality is built in through counterfactual simulation: we run dozens of adversarial trials per task to empirically measure and eliminate deficiencies, on top of multiple layers of manual expert review.

Our methodology
01
Source
High-touch, outbound-first hiring via human sourcing and interviewing by domain experts. Task creators are trained in-house over several weeks. No AI screeners and no crowdsourcing.
Hand-vetted · real industry experience
02
Build
First sample in under 24 hours. 1,000 custom RL environments in 4 weeks.
500+ tasks ready now · 6–28 day horizons for custom work
03
Validate
Human expert review + agent sandbox testing. Failures reflect model capability, not task ambiguity.
Agentic trace validation · 20–40% quality uplift
Our latest dataset
Solve time
2–10 hrs
Per task, skilled data scientist
Frontier pass rate
10–40%
pass@k difficulty
Task depth
5–80+
Requirements per task
Task domains
Data Science Machine Learning Data & Visualization Scripting & Automation Colab Maintenance Research & Education Custom on request
Ready to get started?
500+ verified tasks ready to deploy immediately.
Abundant
Team & Trust abundant.ai
What our research partners say about us
"Abundant's format was the easiest for us to ingest. When we hit limitations with our training infrastructure, they quickly adapted."
Product LeadershipGoogle DeepMind
"Turned around high-quality data super fast — sometimes within 24 hours. Other vendors took months to finalize spec. Abundant did it in weeks."
Founder/CEOFirecrawl
"We showed Abundant's annotation platform to other teams, and everyone immediately wanted access. What started as a pilot for my team became the standard across our research org."
Research ScientistAdobe Research
Leadership
Jesse Hu
Jesse Hu
Co-Founder, ex-Waymo MLE
Ex-Waymo ML Engineer
TL for Data Quality & Evals
Co-author on Terminal Bench
Ke Huang
Ke Huang
Co-Founder, Ops Lead
Google Assistant Eval Lead
GDM Trust & Safety
Brex Compliance & Ops Lead
Rishi Desai
Rishi Desai
Research Lead
Coding agents since 2023
Creator of SWE-Gen
Contributor to Terminal Bench
Meji Abidoye
Meji Abidoye
Co-Founder, Quality and Infra Lead
Ex-AWS Infrastructure Lead
Early contributor to Terminal Bench and Harbor
Sample talent profiles — anonymized for privacy
Data Scientist
Enterprise Technology 4+ years
  • Anomaly detection on 500k+ server log entries/day
  • Classification models for malicious vs. normal user behavior
  • Time-series forecasting for server load spike prediction
AI/ML Lead
Computer Vision & GenAI 6+ years
  • Led end-to-end projects across NLP, vision, and generative AI
  • Biometric systems: face recognition, spoof detection, 3D reconstruction
  • Predictive models, anomaly detection, and high-compute algorithms in production
Data Scientist II
Global Payments 5+ years
  • Fraud detection and transaction optimization on high-volume payment data
  • Explainable AI (SHAP) for root cause analysis at scale
  • End-to-end ML from feature engineering to production deployment
Data Scientist
Energy MS Analytics
  • Foundation-model computer vision systems with VLM-powered inspection
  • ML-driven algorithmic energy trading systems
  • LLM-powered intelligence pipelines for executive decision-making
Ready to get started?
500+ verified tasks ready to deploy immediately.
Abundant
Sample Task Deep Dive Marketing_Attribution_001 Notebook · Marketing Analytics Environment Clean data, compute saturation metrics, visualize ROI
Task prompt
Abridged from 277-line instruction
Build a notebook at /app/analysis.ipynb that analyzes marketing attribution collapse under channel saturation using data/daily_marketing_attribution.csv. Write all outputs to /app/results/.

Required cleaning (exact order): Remove test rows where is_test_row == 1. Parse date and snapshot_ts strictly; drop invalid rows. Drop rows where spend, impressions, clicks, conversions, or revenue violate validity constraints (e.g., clicks > impressions, conversions > clicks). Deduplicate by key (date, region, channel, segment, campaign_id) using a deterministic priority: latest snapshot, then higher revenue, then higher conversions.

Required metrics (metrics.json): Compute 7 metric sections: cleaning_audit, saturation_breakpoints (first date each channel×region hits saturation), lag_response_surface (Pearson correlation of spend vs. future conversion rate at lags 0–3, using 7-day rolling mean), channel_collapse_index (pre- vs. post-saturation ROI delta), mix_adjusted_roi_gap (baseline channel mix from first 30 dates), counterfactual_unsaturated_uplift (35% conversion recovery per unit of over-saturation), and robustness_scenarios (base, exclude social, drop shock window 2024-03-10 to 2024-03-24).

Required visualizations: Produce 4 publication-quality figures: a saturation ratio heatmap (channel-region × month, using sns.heatmap with pivot_table), a lag response surface (per-channel lines with markers), a pre- vs. post-saturation ROI frontier (scatter with diagonal y=x), and a daily ROI timeline by region (with axvspan for shock and rebound windows). Also produce insights.md, run_manifest.json, and plot_contract.json for deterministic plot-data verification.
Docker container
Jupyter notebook + daily_marketing_attribution.csv (10,000+ rows, 3 regions, 4 channels, multiple segments). Saturation thresholds fixed per channel×region.
Python 3.11 2 CPUs / 4 GB RAM 900s timeout
Installed packages
pandas numpy matplotlib seaborn jupyter nbconvert nbformat pytest
Results — verifier groups passed (4 groups)
Codex GPT-5.2
4/4
Gemini CLI 3.1 Pro
4/4
Mini-SWE 3.1 Pro
3/4
Claude Code Opus 4.6
2/4
NOP Baseline no-op
0/4
Agent trajectory — Claude Code (Opus 4.6)
Claude Code (Opus 4.6)
37 steps
6m 27s
PASS 2/4
#0 system init
#1 tool calls
#2 TodoWrite — plan
#3 Bash: ls /app/data/
#4 Bash: head -5 CSV
#5 Bash: wc -l
#6 Bash: python3 explore
#7 TodoWrite — explore ✓
#8 THINK
#9 Bash: mkdir
#10 Write FAIL
#11 Bash: ls
#12 Read: notebook
#13 THINK
#14 Write: 19-cell notebook
#15 tool calls
#16 Bash: nbconvert --execute
#17 success
#18 Bash: ls results/
#19 verify metrics
#20 verify manifest
#21 tool calls
#22 "uplift is 0.0 — suspicious"
#23 investigate
#24
#25
#26
#27 check sat
#28
#29 "no rows saturated"
#30 verify plots
#31 validation
#32 copy back
#33 Bash: cp
#34 TodoWrite ✓
#35 summary
#36 complete
#0 · 0:00
SYS
Session init
#2 · 0:06
Plan
4 tasks: Explore → Build notebook → Execute → Validate
#3 · 0:12
Bash
ls /app/data/
→ daily_marketing_attribution.csv
#6 · 0:30
Bash
Explore CSV
→ (15460, 12) · 3 regions · 4 channels · finds "bad-ts"
#10 · 0:55
Write
/app/analysis.ipynb
✗ "File has not been read yet"
#14 · 1:29
Write
Full 19-cell notebook — cleaning, metrics, visualizations, contracts
#16 · 2:15
Bash
jupyter nbconvert --execute
→ 10 output files generated
21 more steps — investigates suspicious metrics, validates outputs.
Request full trajectories & other agent runs →
Rubric — 27 automated verifiers
Evaluation criteria
01metrics.json has exactly 7 required top-level keys
02cleaning_audit values match oracle
03saturation_breakpoints match oracle
04lag_response_surface correlations match oracle
05channel_collapse_index ROI deltas match oracle
06mix_adjusted_roi_gap values match oracle
07counterfactual_unsaturated_uplift match oracle
08robustness_scenarios (3 variants) match oracle
09Non-trivial rows removed during cleaning
10insights.md exists and is non-empty
114 PNG figures exist with minimum dimensions
12Figures have visual complexity (unique colors)
13plot_contract.json matches oracle exactly
14All outputs generated fresh during run
15run_manifest.json matches contract
16Notebook uses pivot_table for heatmap
17Notebook uses np.corrcoef for correlations
18Uses rolling(window=7, min_periods=7)
19Uses axvspan for window shading
20Uses sns.heatmap with annot=True
21Uses scatter for frontier plot
22Has diagonal y=x reference line
23No outsourced logic or preloaded results
24Heatmap has numeric annotations
25Frontier has scatter points + diagonal
26Timeline has per-region lines + spans
27Lag surface has per-channel lines
Abundant
Dataset Breakdown 29 tasks analyzed
pass@k at least 1 of k passes
65% 50% 35%
@1 @2 @3 @4 @5
62.1%
62.1%
55.2%
55.2%
pass^k all k attempts pass
45% 25% 5%
^1 ^2 ^3 ^4 ^5
27.6%
24.1%
17.2%
3.4%
Gemini CLI
Claude Code
Codex
Mini SWE
BY TOPIC
Optimization & OR
21%
ML & Data Science
14%
ML Interpretability
10%
E-commerce Analytics
10%
Finance & Audit
10%
Computer Vision
10%
Data Engineering
7%
Graph & Network
7%
Other
10%
BY JOB FUNCTION
Data Scientist / ML Eng.
41%
Optimization Engineer
21%
Analytics Engineer
17%
Computer Vision Eng.
10%
Data / Platform Eng.
10%
BY FAILURE MODE — Why frontier models fail
Domain Expertise
24%
Multi-step Pipelines
10%
Math & Algorithmic Reasoning
21%
Instruction Following
10%
Visual Understanding
17%
Tool & Library Fluency
10%
Debugging & ML Practices
7%
Ready to get started?
500+ verified tasks ready to deploy immediately.