Beyond Bug Fixes: An Empirical Investigation of Post-Merge Code Quality Issues in Agent-Generated Pull Requests
The increasing adoption of AI coding agents has led to a growing number of agent-generated pull requests (PRs) being merged with little or no human intervention. While such agentic PRs promise productivity gains, their impact on code quality remain underexplored. While prior work has evaluated coding agents using benchmarks and controlled tasks, large-scale evidence on post-merge quality issues in agentic PRs remains limited. In this study, we analyze 1,210 merged agent-generated bug-fix PRs from Python repositories in the AIDev dataset. Using SonarQube, we perform a differential analysis between base and merged commits to identify code quality issues newly introduced by PR changes. We examine issue frequency, density, severity, and rule-level prevalence across five agents. Our results show that apparent differences in raw issue counts across agents largely disappear after normalizing by code churn, indicating that higher issue counts are primarily driven by larger PRs. Across all agents, code smells dominate, particularly at critical and major severities, while bugs are less frequent but often severe. Security hotspots occur unevenly across agents, most notably for OpenAI Codex. Overall, our findings show that merge success does not reliably reflect post-merge code quality, highlighting the need for systematic, size-aware quality checks for agent-generated bug-fix PRs.