Overview

MLE & ML Research Dataset

This dataset will contain 200+ long-horizon tasks that span 9 ML engineering and research domains, including paper reproduction, training optimization, GPU kernel development, and reinforcement learning.

Each task requires 1–48 hours of agent run time, iterative experimentation, and deep ML knowledge. All verifiers are deterministic, and none rely solely on LLM-as-judge.

Infrastructure work is complete. Tasks run in Docker containers with GPU support via local execution or remote compute on Modal.

Pipeline Status

DONE10+ benchmarks surveyed (MLE-Bench, Paper-Bench, MLAgentBench, etc.)

DONETerminal-Bench / Harbor spec with GPU integration — Docker and remote compute readyCPU-onlySingle-GPU LocalSingle-GPU Remote (Modal)

DONEML engineers and researchers recruited and onboarded

NOWMore tasks — expert-authored, long-horizon ML tasks

NEXTMulti-agent evaluation across all tasks with xx trials each

Planned Domain Distribution

Deep Learning20%

Computer Vision20%

Classical ML15%

NLP10%

Reinforcement Learning10%

Paper Reproduction5%

Post-Training5%

GPU Optimization5%

Training Infra5%

Other5%

Example Task

Achieving top-1 score in a computer vision task given GPUs, benchmarked against previous attempts, and given flexible DevOps resources to run training and inference.

Execution Environment

Local DockerRemote ModalTerminal-Bench / Harbor

CPU-onlySingle-GPU LocalSingle-GPU Remote (Modal)

Failure Modes

Early signal from benchmark evaluation.

Wrong Approach

Missing domain expertise or incorrect assumptions about the problem.

Wrong Output Schema

Valid logic but incorrect output format.

Implementation Bug

Correct approach, minor execution errors.

Premature Stop

Agent terminates before completing all outputs.

Timeout / Resource Limit

Agents exhaust compute or context windows.

Abundant Data · v0.1.0-alphaRef: MLE-01

System Live