ML in a Box: Analyzing Containerization Practices in Open Source ML Projects
This program is tentative and subject to change.
Containerization has become increasingly essential in the machine learning (ML) domain, providing reproducibility, portability, and environment consistency. While prior studies have analyzed Dockerfile structures and best practices, none have examined ML projects in depth to reveal how the iterative nature of ML workflows influences container footprint, build performance, and caching behavior. We present the first large scale empirical study of 1,993 ML related Dockerfiles, combining quantitative analysis of container roles in ML projects and build dynamics with a qualitative investigation of refactoring practices. Results show that containers serve distinct roles across training, inference, and infrastructure. Containers are typically large, averaging 10.27 GB in size, and require long build times of about 8.84 minutes. We find that 44.4% of commits trigger rebuilds, primarily due to context file changes (96.4%), with experimentation being the main motive behind those commits that initiate rebuilds. Despite partial cache reuse, 71% of rebuild work is wasted on redundant computation. From stable projects, we identify 7 recurring ML-specific Dockerfile refactoring patterns that improve build efficiency and reduce container footprint.