MSR 2026
Mon 13 - Tue 14 April 2026 Rio de Janeiro, Brazil
co-located with ICSE 2026

In the rapidly evolving software development landscape, Python stands out for its simplicity, versatility, and extensive ecosystem. Python packages, as units of organization, reusability, and distribution, have become a pressing concern, highlighted by the considerable number of vulnerability reports. Existing benchmarks either do not target Python package-vulnerabilities or faces label accuracy issues stem from non-security-related changes within patching commits. This paper addresses these gaps by introducing PyVul, the first comprehensive benchmark suite of Python-package vulnerabilities. PyVul includes 1,157 publicly reported, developer-verified vulnerabilities, annotated at both the commit level and function level. To enhance labeling quality, we propose LLM-VDC, a generic vulnerability benchmark cleansing method that leverages the code semantic understanding capability of LLMs. LLM-VDC improves PyVul’s function-level label accuracy by 2.0 fold and establish PyVul the most precise automatically collected vulnerability benchmark. Based on PyVul, we conduct the first empirical study to unveil the characteristics of Python-package vulnerabilities and the limitations of state-of-the-art detection tools. Our empirical analysis reveals that current rule-based vulnerability detectors suffer from mismatches between their assumptions and real-world security scenarios, and limited support for high-order vulnerabilities, cross-language interactions, and Python’s unique language features. On the other hand, ML-based detectors suffer from their inability to reach the necessary context. PyVul provides a solid foundation for advancing vulnerability research and tool development in this domain.