MLE & ML Research Dataset
This dataset will contain 200+ long-horizon tasks that span 9 ML engineering and research domains, including paper reproduction, training optimization, GPU kernel development, and reinforcement learning.
Each task requires 1–48 hours of agent run time, iterative experimentation, and deep ML knowledge. All verifiers are deterministic, and none rely solely on LLM-as-judge.
Infrastructure work is complete. Tasks run in Docker containers with GPU support via local execution or remote compute on Modal.
Pipeline Status
Planned Domain Distribution
Achieving top-1 score in a computer vision task given GPUs, benchmarked against previous attempts, and given flexible DevOps resources to run training and inference.
Execution Environment
Failure Modes
Early signal from benchmark evaluation.
Missing domain expertise or incorrect assumptions about the problem.
Valid logic but incorrect output format.
Correct approach, minor execution errors.
Agent terminates before completing all outputs.
Agents exhaust compute or context windows.