Model See, Model Do? Exposure-Aware Evaluation of Bug-vs-Fix Preference in Code LLMs (MSR 2026 - Technical Papers)

Who

Ali Al-Kaswan, Claudio Spiess, Prem Devanbu, Arie van Deursen, Mali Izadi

Track

MSR 2026 Technical Papers

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 13 Apr 2026 11:20 - 11:30 at Oceania V - Session 1-A: AI Agents & Automation Chair(s): Matheus Paixao

Abstract

Large language models are increasingly used for code generation and debugging, but their outputs can still contain bugs, that originate from training data. Distinguishing whether an LLM prefers correct code, or a familiar incorrect version might be influenced by what it’s been exposed to during training. We introduce an exposure-aware evaluation framework that quantifies how prior exposure to buggy versus fixed code influences a model’s preference. Using the ManySStuBs4J benchmark, we apply Data Portraits for membership testing on the Stack-V2 corpus to estimate whether each buggy and fixed variant was seen during training. We then stratify examples by exposure and compare model preference using code completion as well as multiple likelihood-based scoring metrics We find that most examples (67%) have neither variant in the training data, and when only one is present, fixes are more frequently present than bugs. In model generations, models reproduce buggy lines far more often than fixes, with bug-exposed examples amplifying this tendency and fix-exposed examples showing only marginal improvement. In likelihood scoring, minimum and maximum token-probability metrics consistently prefer the fixed code across all conditions, indicating a stable bias toward correct fixes. In contrast, metrics like the Gini coefficient reverse preference when only the buggy variant was seen. Our results indicate that exposure can skew bug-fix evaluations and highlight the risk that LLMs may propagate memorised errors in practice.

Link to Preprint

https://arxiv.org/abs/2601.10496

Ali Al-Kaswan

Delft University of Technology, Netherlands

Netherlands

Claudio Spiess

University of California, Davis

United States

Prem Devanbu

University of California at Davis

United States

Arie van Deursen

TU Delft

Netherlands

Mali Izadi

TU Delft

Netherlands

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 13 Apr
Displayed time zone: Brasilia, Distrito Federal, Brazil change

11:00 - 12:30	Session 1-A: AI Agents & AutomationTechnical Papers / Industry Track / MSR Program at Oceania V Chair(s): Matheus Paixao State University of Ceará

11:00 10m Talk		Toward Linking Declined Proposals and Source Code: An Exploratory Study on the Go Repository Technical Papers Sota Nakashima Kyushu University, Masanari Kondo Kyushu University, Mahmoud Alfadel University of Calgary, Aly Ahmad University of Calgary, Toshihiro Nakae DENSO CORPORATION, Hidenori Matsuzaki DENSO CORPORATION, Yasutaka Kamei Kyushu University Pre-print
11:10 10m Talk		IntelliSA: An Intelligent Static Analyzer for IaC Security Smell Detection Using Symbolic Rules and Neural Inference Technical Papers Qiyue Mei The University of Melbourne, Michael Fu The University of Melbourne Pre-print File Attached
11:20 10m Talk		Model See, Model Do? Exposure-Aware Evaluation of Bug-vs-Fix Preference in Code LLMs Technical Papers Ali Al-Kaswan Delft University of Technology, Netherlands, Claudio Spiess University of California, Davis, Prem Devanbu University of California at Davis, Arie van Deursen TU Delft, Mali Izadi TU Delft Pre-print
11:30 10m Talk		A Match Made in Heaven? AI-driven Matching of Vulnerabilities and Security Unit Tests Technical Papers Emanuele Iannone Hamburg University of Technology, Quang-Cuong Bui Hamburg University of Technology, Riccardo Scandariato Hamburg University of Technology Pre-print
11:40 10m Talk		PhantomRun: Auto Repair of Compilation Errors in Embedded Open Source Software Technical Papers Han Fu , Sigrid Eldh Ericsson AB, Mälardalen University, Carleton University, Kristian Wiklund Ericsson AB, Andreas Ermedahl Ericsson AB; KTH Royal Institute of Technology, Philipp Haller KTH Royal Institute of Technology, Cyrille Artho KTH Royal Institute of Technology, Sweden
11:50 10m Talk		Promises, Perils, and (Timely) Heuristics for Mining Coding Agent Activity Technical Papers Romain Robbes CNRS, LaBRI, University of Bordeaux, Théo Matricon Univ Rennes, INSA Rennes, Inria, CNRS, IRISA, Thomas Degueule CNRS, Andre Hora UFMG, Stefano Zacchiroli LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France Pre-print Media Attached
12:00 10m Talk		From Logic to Toolchains: An Empirical Study of Bugs in the TypeScript Ecosystem Technical Papers TianYi Tang Simon Fraser University, Saba Alimadadi Simon Fraser University, Nick Sumner Simon Fraser University Pre-print
12:10 10m Talk		Are We All Using Agents Now? An Empirical Study of Core and Peripheral Developers’ Use of Coding Agents Technical Papers Shamse Tasnim Cynthia University of Saskatchewan, Joy Krishan Das University of Saskatchewan, Banani Roy University of Saskatchewan Pre-print
12:20 5m Talk		Context Engineering for AI Agents in Open-Source Software Technical Papers Seyedmoein Mohsenimofidi Heidelberg University, Matthias Galster University of Canterbury, Christoph Treude Singapore Management University, Sebastian Baltes Heidelberg University Pre-print
12:25 5m Talk		A Blueprint for Trustworthy Code Annotation at Scale: An LLM-Powered Pipeline for Industrial Software Analytics Industry Track Ailon dos Santos Teixeira UFAM, Jaine Brito da Silva UFAM, Nikolas Rocha de Medeiros UFAM, Raimundo da Silva Barreto UFAM, José Reginaldo Hughes Carvalho UFAM, Alex Fernando Monteiro UFAM