Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

5.2 Implications

Based on our findings, we discuss the pros and cons of using LLMs for generating mutations and suggest potential improvements.

First, our study shows that existing LLMs have the potential to generate high-quality mutations for mutation testing with acceptable costs. The greatest advantage of LLMs is their ability to generate mutations that more closely mimic real bugs in behavior. For example, Listing 1 shows two compilable mutations generated by GPT-3.5 in Chart-1, which are theoretically beyond the capabilities of existing approaches. In Mutation-1, the LLM inferences there must exist leftBlock because of the occurrence of rightBlock, and replaces the variable correspondingly. This mutation is not killed by the bug-triggering test. Without natural language understanding capabilities (i.e., understanding the correspondence

between “right” and “left”), it is impossible to create such mutations. In Mutation-2, the LLM modifies the else branch to else if, employing an integer variable seriesCount from context to synthesize the condition, showing the capabilities of code structural changing that existing methods cannot achieve. Also, this mutation is alive and can guide test enhancement.

Second, our study explores how different prompts, models, contexts, and few-shot examples impact mutation generation performance. Our study highlights the importance of using proper prompts and models with stronger coding capabilities. In particular, designing prompts requires providing essential context information (such as the whole method), but including too much information (such as the target unit test) may reduce effectiveness.

Third, although LLMs achieve promising results in mutation testing, our study indicates that there is still significant room for improvement. Specifically, current LLMs generate a substantial number of mutations that are not compilable. Therefore, we analyze the types of mutations that lead to non-compilable code and identify some root causes of the LLMs. We advocate for more research efforts in program analysis and repair approaches to address error-prone code elements for mutation such as method invocations.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.