Consistent or Sensitive? Automated Code Revision Tools Against Semantics-Preserving Perturbations
This program is tentative and subject to change.
Automated Code Revision (ACR) tools aim to reduce human effort in code revision by automatically generating revised code based on reviewer feedback and have shown promising performance. Yet, for them to be reliable in real-world software development practice, they must be consistent in generating code revisions for input code variants that express the same issue.This challenge becomes even more pronounced in the case of AI-assisted ACR tools, whose probabilistic generation process can lead to nondeterministic and inconsistent revisions across semantically equivalent code variants. In this paper, we evaluate the consistency of five state-of-the-art transformer-based ACR tools against semantics-preserving perturbations, controlled code modifications that alter a program’s structure or syntax without changing its behavior. We designed ten types of such perturbations and applied them to 2,032 Java methods from real-world GitHub projects, generating over 18K perturbed variants for evaluation. These perturbations were applied one at a time in order isolate the effect of each perturbation type and systematically assess how consistently models such as T5, LLaMA variants, ChatGPT, and DeepSeek generate correct revisions for semantically identical but syntactically altered code. Our findings show that the later models’ ability to generate correct revisions can drop by up to 45.3%, when presented with semantically equivalent yet structurally modified code. The extreme drop in consistency manifests when perturbations occur near the specific lines of code referenced in the reviewer’s feedback. The closer the perturbation is to this target region, the more likely an ACR tool fails to generate the correct revision. We further explored potential mitigation strategies, evaluating attention-based prompt heuristics, such as Chain-of-Thought prompting, repeating the referenced code region within the review comment, or embedding the feedback directly as inline code comments near that region, which did not lead to any improvement in consistency.