Abundant
View a sample task →
500+ multi-step data science tasks, each with a Docker environment, programmatic verifiers, and RL-ready training signal.
Trusted by 2 of the top 3 AI labs.
The problem
Most data science evals give you bad signal.
- Tasks don't reflect real-world distributions
- Environments aren't realistic or complex enough
- Verifiers are over-specified, under-specified, or hackable
- Result: Invalid RL training signal and evals you can't trust.
Our approach
We believe the unlock to safe, reliable intelligence is simulation — environments where agents plan, act, and face real consequences. We specialize in highly technical domains with long task horizons and complex tooling. Quality is built in through counterfactual simulation: we run dozens of adversarial trials per task to empirically measure and eliminate deficiencies, on top of multiple layers of manual expert review.
Our methodology
01
Source
High-touch, outbound-first hiring via human sourcing and interviewing by domain experts. Task creators are trained in-house over several weeks. No AI screeners and no crowdsourcing.
Hand-vetted · real industry experience
02
Build
First sample in under 24 hours. 1,000 custom RL environments in 4 weeks.
500+ tasks ready now · 6–28 day horizons for custom work
03
Validate
Human expert review + agent sandbox testing. Failures reflect model capability, not task ambiguity.
Agentic trace validation · 20–40% quality uplift
Our latest dataset
Solve time
2–10 hrs
Per task, skilled data scientist
Frontier pass rate
10–40%
pass@k difficulty
Task depth
5–80+
Requirements per task
Task domains
Data Science
Machine Learning
Data & Visualization
Scripting & Automation
Colab Maintenance
Research & Education
Custom on request
Ready to get started?
500+ verified tasks ready to deploy immediately.



