An ultra-long-horizon benchmark designed to evaluate coding agents on realistic, high-complexity software engineering tasks.
Join the RALPHBench GitHub repo and review the task guidelines.
Tasks follow the Harbor task framework (from the Terminal-Bench creators) with automated evaluation and full solution.
Open a PR. One approved task earns co-authorship on the NeurIPS 2026 paper.