sia.hackernoon.com

Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Table of Links

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

5.3 Threats to Validity

The selected LLMs, programming language, datasets, and baseline approaches could be a threat to the validity of our results. To mitigate this threat, we adopt the most widely studied models (i.e., GPT and CodeLlama), the most popular language (i.e., Java), and the most popular dataset Defects4J. We also employ state-of-the-art mutation testing approaches as baselines, including learning-based (i.e., 𝜇Bert and LEAM) and rule-based (i.e., PIT and Major).

Another validity threat may be due to data leakage, i.e., the fact that the data in Defects4J [37] may be covered in the training set of the studied LLMs. To mitigate this threat, we employed another dataset ConDefects [82] which includes programs and faults that were made after the release time of the LLMs we use and thus have limited data leakage risk. Additionally, to increase confidence in our results, we also checked whether the tools can introduce exact matches (syntactically) with the studied faults. We hypothesize that in case the tools have been tuned based on specific fault instances, the tools would introduce at least one mutation that is an exact match with the faults we investigate. Our results are: GPT, CodeLlama, Major, LEAM, and 𝜇Bert, 282, 77, 67, 386, 39 on the Defects4J dataset while on the ConDefects are 7, 9, 13, 8, 1, respectively, and indicate that on Defects4J GPT and LEAM, approaches tend to produce significantly more exact matches than the other approaches. Interestingly, Major produces a similar number of exact matches with Codellama. 𝜇Bert wields significantly the least number of exact matches, indicating a minimal or no advantage for all these approaches (except from GPT and LEAM) due to exact matches (in the case of Defects4J). Perhaps more interesting, in the ConDefects dataset, which has not been seen by any of the tools, Major has the majority of the exact matches, indicating a minor influence of any data leakage on the reported results. Nevertheless, the LLMs we studied exhibit the same trend on the two datasets, achieving the Spearman coefficient of 0.943 and the Pearson correlation of 0.944, both with 𝑝-value less than 0.05, indicating their performance is similar on the two datasets.

The different experimental settings may also threaten the validity of our results. To address this threat, we elaborately explore the impacts of prompts, context length, few-shot examples, and mutation numbers on the performance of LLMs. The results show that different settings are highly similar.

The subjective nature of human decisions in labeling for equivalent mutations and non-compilation errors is another potential threat. To mitigate this threat, we follow a rigorous annotation process where two co-authors independently annotated each mutation. The final Cohen’s Kappa coefficient indicates a relatively high level of agreement between the two annotators.

6 CONCLUSION

In this paper, we systematically investigate the performance of LLMs in generating mutations. We evaluate their utility from several aspects and find LLMs have the advantage of generating diverse mutations that mimic the behaviour of real bugs. We also analyze and discuss some limitations of LLMs and point out further directions. We advocate research efforts to promote LLM for mutation testing in the future.

REFERENCES

[1] 2024. Anon. Repo of Kumo. https://anonymous.4open.science/r/kumo-01D1/ Accessed: June 1, 2024.

[2] 2024. Atcoder. https://atcoder.jp Accessed: June 1, 2024.

[3] 2024. Javalang Parser. https://pypi.org/project/javalang Accessed: June 1, 2024.

[4] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).

[5] Paul Ammann, Márcio Eduardo Delamaro, and Jeff Offutt. 2014. Establishing Theoretical Minimal Sets of Mutants. In Seventh IEEE International Conference on Software Testing, Verification and Validation, ICST 2014, March 31 2014-April 4, 2014, Cleveland, Ohio, USA. IEEE Computer Society, 21–30.

[6] James H Andrews, Lionel C Briand, and Yvan Labiche. 2005. Is mutation an appropriate tool for testing experiments?. In Proceedings of the 27th international conference on Software engineering. 402–411.

[7] Moritz Beller, Chu-Pan Wong, Johannes Bader, Andrew Scott, Mateusz Machalica, Satish Chandra, and Erik Meijer. 2021. What it would take to use mutation testing in industry—a study at facebook. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 268– 277.

[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.

[9] Timothy A Budd and Dana Angluin. 1982. Two notions of correctness and their relation to testing. Acta informatica 18, 1 (1982), 31–45.

[10] Thierry Titcheu Chekam, Mike Papadakis, Tegawendé F. Bissyandé, Yves Le Traon, and Koushik Sen. 2020. Selecting fault revealing mutants. Empir. Softw. Eng. 25, 1 (2020), 434–487.

[11] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

[12] Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. Pit: a practical mutation testing tool for java. In Proceedings of the 25th international symposium on software testing and analysis. 449–452.

[13] Murial Daran and Pascale Thévenod-Fosse. 1996. Software error analysis: A real case study involving real faults and mutations. ACM SIGSOFT Software Engineering Notes 21, 3 (1996), 158–171.

[14] SOURAV DEB, KUSH JAIN, RIJNARD VAN TONDER, CLAIRE LE GOUES, and ALEX GROCE. 2024. Syntax Is All You Need: A Universal-Language Approach to Mutant Generation. (2024).

[15] Renzo Degiovanni and Mike Papadakis. 2022. 𝜇bert: Mutation testing using pre-trained language models. In 2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 160–169.

[16] Márcio Eduardo Delamaro, José Carlos Maldonado, and A Mathur. 1996. Proteuma tool for the assessment of test adequacy for c programs user’s guide. In PCS, Vol. 96. 79–95.

[17] Richard A DeMillo, Richard J Lipton, and Frederick G Sayward. 1978. Hints on test data selection: Help for the practicing programmer. Computer 11, 4 (1978), 34–41.

[18] Richard A DeMillo, Richard J Lipton, and Frederick G Sayward. 1979. Program mutation: A new approach to program testing. Infotech State of the Art Report, Software Testing 2, 1979 (1979), 107–126.

[19] Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2024. Large language models are edge-case generators: Crafting unusual programs for fuzzing deep learning libraries. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13.

[20] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[21] João P Diniz, Chu-Pan Wong, Christian Kästner, and Eduardo Figueiredo. 2021. Dissecting strongly subsuming second-order mutants. In 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 171–181.

[22] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, Los Alamitos, CA, USA, 31–53.

[23] Aayush Garg, Milos Ojdanic, Renzo Degiovanni, Thierry Titcheu Chekam, Mike Papadakis, and Yves Le Traon. 2022. Cerebro: Static subsuming mutant selection. IEEE Transactions on Software Engineering 49, 1 (2022), 24–43.

[24] Milos Gligoric, Vilas Jagannath, Qingzhou Luo, and Darko Marinov. 2013. Efficient mutation testing of multithreaded code. Software Testing, Verification and Reliability 23, 5 (2013), 375–403.

[25] Alex Groce, Iftekhar Ahmed, Josselin Feist, Gustavo Grieco, Jiri Gesi, Mehran Meidani, and Qihong Chen. 2021. Evaluating and improving static analysis tools via differential mutation analysis. In 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). IEEE, 207–218.

[26] Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, and Xin Peng. 2024. Exploring the potential of chatgpt in automated code refinement: An empirical study. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13.

[27] Richard G. Hamlet. 1977. Testing programs with the aid of a compiler. IEEE transactions on software engineering 4 (1977), 279–290.

[28] Farah Hariri, August Shi, Vimuth Fernando, Suleman Mahmood, and Darko Marinov. 2019. Comparing mutation testing at the levels of source code and compiler intermediate representation. In 2019 12th IEEE conference on software testing, validation and verification (ICST). IEEE, 114–124.

[29] Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. 2016. On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131.

[30] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2023. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 (2023).

[31] Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. In Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 510–520.

[32] Yue Jia and Mark Harman. 2008. MILU: A customizable, runtime-optimized higher order mutation testing tool for the full C language. In Testing: Academic & Industrial Conference-Practice and Research Techniques (taic part 2008). IEEE, 94–98.

[33] Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678.

[34] Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis. 298–309.

[35] Nan Jiang, Thibaud Lutellier, Yiling Lou, Lin Tan, Dan Goldwasser, and Xiangyu Zhang. 2023. Knod: Domain knowledge distilled tree decoder for automated program repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1251–1263.

[36] Matthieu Jimenez, Thiery Titcheu Checkam, Maxime Cordy, Mike Papadakis, Marinos Kintis, Yves Le Traon, and Mark Harman. 2018. Are mutants really natural? a study on how" naturalness" helps mutant selection. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 1–10.

[37] René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. 437–440.

[38] René Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 654–665.

[39] Rene Just, Franz Schweiggert, and Gregory M Kapfhammer. 2011. MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler. In ASE. 612–615.

[40] Samuel J Kaufman, Ryan Featherman, Justin Alvin, Bob Kurtz, Paul Ammann, and René Just. 2022. Prioritizing mutants to guide mutation testing. In Proceedings of the 44th International Conference on Software Engineering. 1743–1754.

[41] Ayaan M Kazerouni, James C Davis, Arinjoy Basak, Clifford A Shaffer, Francisco Servant, and Stephen H Edwards. 2021. Fast and accurate incremental feedback for students’ software tests using selective mutation analysis. Journal of Systems and Software 175 (2021), 110905.

[42] Ahmed Khanfir, Anil Koyuncu, Mike Papadakis, Maxime Cordy, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2023. IBiR: Bug-report-driven fault injection. ACM Transactions on Software Engineering and Methodology 32, 2 (2023), 1–31.

[43] Jinhan Kim, Juyoung Jeon, Shin Hong, and Shin Yoo. 2022. Predictive mutation analysis via the natural language channel in source code. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 4 (2022), 1–27.

[44] Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2011. Genprog: A generic method for automatic software repair. Ieee transactions on software engineering 38, 1 (2011), 54–72.

[45] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023). [

46] Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023. Nuances are the key: Unlocking chatgpt to find failureinducing tests with differential prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 14–26.

[47] Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56.

[48] Mario Linares-Vásquez, Gabriele Bavota, Michele Tufano, Kevin Moran, Massimiliano Di Penta, Christopher Vendome, Carlos Bernal-Cárdenas, and Denys Poshyvanyk. 2017. Enabling mutation testing for android apps. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 233–244.

[49] Zeyang Ma, An Ran Chen, Dong Jae Kim, Tse-Hsun Chen, and Shaowei Wang. 2024. LLMParser: An Exploratory Study on Using Large Language Models for Log Parsing. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, 883–883.

[50] Seokhyeon Moon, Yunho Kim, Moonzoo Kim, and Shin Yoo. 2014. Ask the mutants: Mutating faulty programs for fault localization. In 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation. IEEE, 153–162.

[51] Manish Motwani and Yuriy Brun. 2023. Better automatic program repair by using bug reports and tests together. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1225–1237.

[52] Akbar Siami Namin and Sahitya Kakarla. 2011. The use of mutation in testing experiments and its sensitivity to external threats. In Proceedings of the 2011 International Symposium on Software Testing and Analysis. 342–352.

[53] A Jefferson Offutt. 1992. Investigations of the software testing coupling effect. ACM Transactions on Software Engineering and Methodology (TOSEM) 1, 1 (1992), 5–20.

[54] A Jefferson Offutt and Roland H Untch. 2001. Mutation 2000: Uniting the orthogonal. Mutation testing for the new century (2001), 34–44.

[55] Milos Ojdanic, Aayush Garg, Ahmed Khanfir, Renzo Degiovanni, Mike Papadakis, and Yves Le Traon. 2023. Syntactic vs. semantic similarity of artificial and real faults in mutation testing studies. IEEE Transactions on Software Engineering (2023).

[56] Milos Ojdanic, Ahmed Khanfir, Aayush Garg, Renzo Degiovanni, Mike Papadakis, and Yves Le Traon. 2023. On Comparing Mutation Testing Tools through Learning-based Mutant Selection. In IEEE/ACM International Conference on Automation of Software Test, AST 2023, Melbourne, Australia, May 15-16, 2023. IEEE, 35–46.

[57] Mike Papadakis, Christopher Henard, Mark Harman, Yue Jia, and Yves Le Traon. 2016. Threats to the validity of mutation-based test assessment. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 354–365.

[58] Mike Papadakis, Yue Jia, Mark Harman, and Yves Le Traon. 2015. Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In ICSE. 936–946.

[59] Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: an analysis and survey. In Advances in Computers. Vol. 112. Elsevier, 275–378.

[60] Mike Papadakis and Yves Le Traon. 2015. Metallaxis-FL: mutation-based fault localization. Software Testing, Verification and Reliability 25, 5-7 (2015), 605–628.

[61] Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are mutation scores correlated with real fault detection? a large scale empirical study on the relationship between mutants and real faults. In Proceedings of the 40th international conference on software engineering. 537–548.

[62] Jibesh Patra and Michael Pradel. 2021. Semantic bug seeding: a learning-based approach for creating realistic bugs. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 906–918.

[63] Ruixiang Qian, Quanjun Zhang, Chunrong Fang, and Lihua Guo. 2022. Investigating coverage guided fuzzing with mutation testing. In Proceedings of the 13th Asia-Pacific Symposium on Internetware. 272–281.

[64] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).

[65] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering (2023).

[66] David Schuler and Andreas Zeller. 2009. Javalanche: efficient mutation testing for Java. In ESEC/FSE. 297–298.

[67] August Shi, Jonathan Bell, and Darko Marinov. 2019. Mitigating the effects of flaky tests on mutation testing. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 112–122.

[68] Donghwan Shin, Shin Yoo, Mike Papadakis, and Doo-Hwan Bae. 2019. Empirical evaluation of mutation-based test case prioritization techniques. Software Testing, Verification and Reliability 29, 1-2 (2019), e1695.

[69] Akbar Siami Namin, James H Andrews, and Duncan J Murdoch. 2008. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the 30th international conference on Software engineering. 351–360.

[70] Zhao Tian, Junjie Chen, Qihao Zhu, Junjie Yang, and Lingming Zhang. 2022. Learning to construct better mutation faults. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13. [

71] Frank Tip, Jonathan Bell, and Max Schäfer. 2024. LLMorpheus: Mutation Testing using Large Language Models. arXiv preprint arXiv:2404.09952 (2024).

[72] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).

[73] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. Learning how to mutate source code from bug-fixes. In 2019 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 301–312.

[74] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[75] Anthony J Viera, Joanne M Garrett, et al. 2005. Understanding interobserver agreement: the kappa statistic. Fam med 37, 5 (2005), 360–363.

[76] Bo Wang, Sirui Lu, Yingfei Xiong, and Feng Liu. 2021. Faster mutation analysis with fewer processes and smaller overheads. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 381–393.

[77] Bo Wang, Yingfei Xiong, Yangqingwei Shi, Lu Zhang, and Dan Hao. 2017. Faster mutation analysis via equivalence modulo states. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 295–306.

[78] Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering (2024).

[79] Xinyi Wang, Tongxuan Yu, Paolo Arcaini, Tao Yue, and Shaukat Ali. 2022. Mutation-based test generation for quantum programs with multi-objective search. In Proceedings of the Genetic and Evolutionary Computation Conference. 1345–1353.

[80] Ming Wen, Yepang Liu, Rongxin Wu, Xuan Xie, Shing-Chi Cheung, and Zhendong Su. 2019. Exposing library API misuses via mutation analysis. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 866–877.

[81] Jiang Wu, Yan Lei, Zhuo Zhang, Xiankai Meng, Deheng Yang, Pan Li, Jiayu He, and Xiaoguang Mao. 2023. Mantra: Mutation Testing of Hardware Design Code Based on Real Bugs. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.

[82] Yonghao Wu, Zheng Li, Jie M Zhang, and Yong Liu. 2023. Condefects: A new dataset to address the data leakage concern for llm-based fault localization and program repair. arXiv preprint arXiv:2310.16253 (2023).

[83] Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: revisiting automated program repair via zero-shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 959–971.

[84] Yuan-An Xiao, Chenyang Yang, Bo Wang, and Yingfei Xiong. 2023. ExpressAPR: Efficient patch validation for java automated program repair systems. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2038–2041.

[85] Yuan-An Xiao, Chenyang Yang, Bo Wang, and Yingfei Xiong. 2024. Accelerating patch validation for program repair with interception-based execution scheduling. IEEE Transactions on Software Engineering (2024).

[86] Jie Zhang, Ziyi Wang, Lingming Zhang, Dan Hao, Lei Zang, Shiyang Cheng, and Lu Zhang. 2016. Predictive mutation testing. In Proceedings of the 25th international symposium on software testing and analysis. 342–353.

[87] J Zhang, L Zhang, M Harman, D Hao, and Y Jia. 2018. Predictive Mutation Testing. IEEE Transactions on Software Engineering (2018).

[88] Lingming Zhang, Darko Marinov, and Sarfraz Khurshid. 2013. Faster mutation testing inspired by test prioritization and reduction. In Proceedings of the 2013 International Symposium on Software Testing and Analysis. 235–245.

[89] Peng Zhang, Yang Wang, Xutong Liu, Yanhui Li, Yibiao Yang, Ziyuan Wang, Xiaoyu Zhou, Lin Chen, and Yuming Zhou. 2022. Mutant reduction evaluation: what is there and what is missing? ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 4 (2022), 1–46.

[90] Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A syntax-guided edit decoder for neural program repair. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 341–353.

[91] Qihao Zhu, Zeyu Sun, Wenjie Zhang, Yingfei Xiong, and Lu Zhang. 2023. Tare: Type-aware neural program repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1443–1455.

[92] Daming Zou, Jingjing Liang, Yingfei Xiong, Michael D Ernst, and Lu Zhang. 2019. An empirical study of fault localization families and their combinations. IEEE Transactions on Software Engineering 47, 2 (2019), 332–347.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

Making AI-Powered Mutation Testing Reliable and Fair

Table of Links

5.3 Threats to Validity

6 CONCLUSION

REFERENCES