Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

3.3 Mutation Generation via LLMs

Given the location of a Java program, we extract essential information (e.g., the context method or the corresponding unit tests) to formulate prompts instructing mutation generation, then feed prompts to the selected LLM. Once the LLM returns its responses, we filter out mutations that are non-compilable, redundant, or identical to the original code, and return the generated mutations. As shown in Section 3.4, our study supports a comprehensive set of metrics for evaluating mutations.

3.3.1 Models. We aim to comprehensively compare the existing LLMs in their ability to generate mutations from various perspectives. This includes evaluating the capabilities of models that are fine-tuned on different base models and contrasting the performance of commercial closed-source models with open-source alternatives. In our study, we select LLMs based on the following criteria. (1) The LLM must have coding capabilities, i.e., understanding, generating, and transforming code. (2) The LLM must support

interaction in chat mode, allowing for conversational code transformation. (3) The LLM must offer accessible APIs for integration.

Finally, we select the following models: GPT-3.5-Turbo [4], GPT4-Turbo [4], CodeLlama-13b-Instruct [64], and StarChat-𝛽-16b [45]). The details of these models are shown as Table 2. These LLMs cover open-source (i.e., StarChat and CodeLlama) and closed-source (i.e., GPT-3.5-Turbo and GPT-4-Turbo), and various kinds of base models.

3.3.2 Prompts. Prompts are crucial for LLMs because they guide the models’ responses and determine the direction of the generated output. A well-crafted prompt can significantly influence the utility, relevance, and accuracy of the code produced by an LLM. This subsection introduces how we design the prompts.

To design effective prompts for mutation generation, we follow the best practices suggesting prompts should comprise four aspects, namely, Instruction, Context, Input Data, and Output Indicator [26, 49]. In the Instructions, we direct the LLMs to generate mutants for the target code element. In the Context, we clearly state that mutant is the concept of mutation testing, and additionally provide various sources of information to observe the impact on the performance of LLMs, including the whole Java method surrounding the target code element and few-shot examples sampled from real-world bugs from another benchmark. In the Input Data, we specify the code element to be mutated and the number of mutants to generate. In the Output Indicator, we specify the JSON file format for mutation outputs, facilitating further experimentation. Figure 1 shows the default prompt template in our study.

In addition, our prompts need few-shot examples, which should originate from real bugs enabling LLMs to learn how to mutate code from real bugs. This allows the model to understand the context and logic behind the changes, improving its ability to generate relevant and effective mutations.

To avoid the few-shot examples leaking information, beyond Defects4J and ConDefects, we employ another benchmark QuixBugs [47], which comprises 40 real-world Java bugs and is commonly used by the testing and debugging community [35, 46, 83, 90, 91]. Following the existing study on few-shot examples [19], we sample 6 of the bugs from QuixBugs.

To guarantee the diversity of the examples, we randomly select one bug from the dataset and assess whether its modification pattern is similar to the examples already collected. If it is different, we add it to our collection and continue, until we have collected all 6 examples. Table 3 shows these examples.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.