Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

3.4 Evaluation Metrics

To assess the utility of the LLM-generated mutations, we should define a set of metrics that comprehensively evaluate them. Therefore, we designed several categories of metrics capturing different

aspects as follows: Cost Metrics, Usability Metrics, and Behavior Metrics. Besides these metrics, we also additionally report the Mutation Count and Mutation Score. Note that both of them cannot directly reflect the quality of generated mutations. For example, a higher mutation score does not indicate a higher quality, which may also be caused by generating a large proportion of mutants that are easily killed.

To better understand the metrics, we first illustrate the categories of mutations. Let all generated mutations be the set 𝐴, where LLMs may generate syntactically incorrect mutations that can not pass compilation. Thus we denote the compilable mutations as the set 𝐢. Note that the set 𝐢 contains two types of mutations that should be removed, the useless mutations (denoted as π‘ˆ ) and equivalent mutations (denoted as 𝐸). The useless mutations refer to those that are identical to the original code or other mutations. In mutation testing, the ideal set of mutations to employ is from the set (𝐢 βˆ’ 𝐸 βˆ’ π‘ˆ ), excluding equivalent mutations that have no contributions to discovering the weakness of tests. However, determining the equivalent mutations set 𝐸 is undecidable, so we follow the common practice and use the set (𝐢 βˆ’ π‘ˆ ) as a practical approximation to perform mutation testing [21, 70, 89].

3.4.1 Cost Metrics. The Cost Metrics indicate time and economic costs in mutation generation, which contains the following two metrics.

Average Generation Time: the average duration needed to produce each mutation, measuring the efficiency of an approach.

Cost per 1K Mutations: the average money cost in US dollars to generate 1k mutations. For GPT-3.5 and GPT-4, we calculate the average cost by API usage, while for the open-source models, we employ the costs of renting cloud servers.

3.4.2 Usability Metrics. The Usability Metrics serve as key indicators of the practical utility of generated mutations.

Compilability Rate: the proportion of mutations that successfully compile, computed as: |𝐢|/|𝐴|. Because LLMs cannot guarantee the generation of entirely correct code and the compilation process can be time-consuming, we prefer the approach with a higher comparability rate.

Useless Mutation Rate: the proportion of useless mutations, computed as: |π‘ˆ |/|𝐴|. LLMs sometimes generate duplicate mutations or mutations that are exactly the same as the original code. While traditional approaches rarely generate useless mutations, LLMs suffer from issues of misunderstanding and repetition, leading to generating these useless mutations.

Equivalent Mutation Rate: the proportion of equivalent mutation, computed as: |𝐸|/|𝐴|. Equivalent mutations are semantically equivalent to the original program, which impacts the accuracy of the mutation score. Due to the undecidable nature of identifying whether a mutation is equivalent, we sample significant portions of all mutations and judge them manually. More specifically, given the population to be the set 𝐴, we set the confidence level to be 95% and the margin of error to be 5%, to get a subset of mutations to be manually inspected.

AST Node Diversity: We prefer the set of mutations with diverse forms, therefore, we parse the code before and after mutation, and check the diversity of newly introduced AST nodes. The approach with a greater variety of AST node types is considered superior.

3.4.3 Behavior Similarity with Real Bugs. Behavior similarity refers to the degree of similarity in functionality and execution between a mutation and its corresponding real bug. Because capturing all program behaviors is undecidable, following existing studies, we thus employ the following metrics as a proxy to measure behavior similarity [42, 55, 61].

Real Bug Detectability: Given the location of a real bug (which may occupy multiple lines), let a mutation generation approach generate a set of mutations, if the tests that kill these mutations can also kill this real bug, we say that the mutation generation approach can β€œdetect” this real bug. Given a set of real bugs and a mutation generation approach, Real Bug Detectability quantifies the number of bugs detected by this mutation approach.

Coupling Rate: Mutation testing is based on the fault coupling effect, i.e., mutations though are simple syntactic changes, coupled to more complex bugs [53]. To describe the extent to which a mutation couples to its corresponding real bug, we adopt the Coupling Rate following the existing study [38]. Given a real bug and a corresponding mutant generated at the same location, the mutant is deemed a coupled mutant if the set of the test cases killing the mutant overlaps with the set of the bug-triggering test cases. The Coupling Rate refers to the proportion of coupled mutants within (𝐢 βˆ’ π‘ˆ ).

3.5 Experiment Settings

3.5.1 Settings for Mutation Generation. One major target of our study is to investigate the similarity between LLM-generated mutations and real bugs. Therefore, we generate mutations within the context of real bugs to enable such a comparison. To determine the number of mutations generated in a prompt, we adopt the setting of PIT [12]. We use PIT-generated mutations with the full set of operators and find that PIT generates 1.15 mutations per line of code. Therefore, based on the given code context, we generate one mutation per line. We explore the influence of context length on mutation generation in Section 5.

3.5.2 Settings for RQ1-RQ2. In the first part of our study, we intend to qualitatively and comparatively evaluate the capability of LLMs in generating mutations. We adopt the default settings, which contain two models (i.e., GPT-3.5-Turbo and CodeLlama), and use the default prompt template, generating mutations on the universal set of the 405 bugs in Section 3.2.

To understand the performance of the LLMs, we adopt LEAM [70], πœ‡Bert [15], PIT [12], and Major [39] as baselines. Among these approaches, LEAM is the state-of-the-art mutation generation approach which employs a model trained from 13 million real Java bugs. While πœ‡Bert is based on Bert [20], a masked pre-trained language model that can not conversational respond to human-like text and code.PIT [12] and Major [39] are the popular traditional mutation testing tools with human-predefined simple mutation operators. Note that for PIT and Major, we employ all mutation operators.

3.5.3 Settings for RQ3-RQ4. In the second part of our study, we intend to explore how different prompts and different models impact the performance of mutation generation. Due to budget limitations, we sample a subset of 105 bugs from our dataset, which consists of 10 sampled bugs from each Defects4J project and all ConDefects bugs.

In RQ3, for the prompts exploration, we use the GPT-3.5-Turbo model and switch different prompt templates. More specifically, we modify the default prompt template by adding and removing different sources of information in the Context, and we finally get 4 types of prompts, shown as follows:

(1) Prompt 1 (P1): The Default Prompt. P1 is the default prompt, shown as Figure 1.

(2) Prompt 2 (P2): P1 without Few-Shot Examples. P2 is based on P1 by removing all the examples.

(3) Prompt 3 (P3): P2 without Whole Java Method. P3 is based on P2 by removing the surrounding Java method code, only keeping the target code element to be mutated.

(4) Prompt 4 (P4): P1 with Unit Tests. Based on the default prompt P1, P4 further adds the source code of corresponding unit tests of the target method into Context.

In RQ4, for the LLMs exploration, we use the default prompt and investigate the performance of GPT-3.5-Turbo, GPT-4-Turbo, CodeLlama-13b-Instruct, and StarChat-𝛽-16b.

3.5.4 Settings for RQ5. Generating non-compilable mutations incurs additional meaningless costs, so in the third part of our study, we study the root causes behind LLM generating non-compilable mutations. Specifically, we sample 384 non-compilable mutations of each mutation generation approach and analyze their reason rejected by the Java compiler and the features of their surrounding code context.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.