A Task-Level Evaluation of AI Agents in Open-Source Projects
In this paper, we present a comparative study of five autonomous coding agents using the publicly released AIDev-pop dataset, which captures thousands of AI-generated pull requests (PRs) across popular open-source repositories. We evaluate agent performance along three objective, task-aware dimensions spanning the PR lifecycle: (1) PR acceptance rate, (2) review discussion volume, and (3) commit message quality. We find that Codex achieves consistently high PR acceptance rates across most task categories, while Copilot-generated PRs trigger the highest volume of both human and automated review discussion. In contrast, commit-level quality varies independently of acceptance outcomes, with Claude and Cursor producing higher proportions of high-quality commit messages across several task types, and Codex exhibiting comparatively lower commit quality despite strong integration outcomes. Our findings inform tool selection, guide future agent development, and support more effective integration of autonomous coding agents into collaborative software engineering workflows.