Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

Mutation Testing and Mutation Generation. Mutation testing is a systematic testing approach that aims to guide the creation of high-utility test suites. It works by introducing syntactic modifications (i.e., mutations) into the source code of the programs under test. By observing if test suites can detect these changes [17, 27] one can identify weaknesses of their test suites, i.e., mutants that are not killed. Besides testing, mutation analysis is a tool that can be used to support multiple applications in the field of software engineering [24, 25, 41, 48, 63, 67, 79, 81]. Most commonly, mutations are employed as substitutes for real-world bugs [6, 13, 38, 52], to guide fault localization [50, 60], test prioritization [68] and program repair [84, 85].

All these tasks require high-utility mutations. This is because it is required to compile and execute every mutation with tests, which is computationally expensive. Therefore, generating redundant or non-compilable mutations, as done by most of the existing learningbased approaches, increases the computational cost involved [23, 76, 77, 87]. Additionally, when mutations are used as substitutes for real defects, the mutations must be as syntactically close to the real bugs as possible, making the mutations more natural [29, 36]. Moreover, when performing tasks such as fault localization [50, 60] and program repair [34, 44, 91], the behavior of the mutations needs to couple with real faults/bugs to allow these techniques to be effective.

Traditional mutation testing approaches generate mutations by pre-defined program transformation rules (i.e., mutation operators), which are referred to rule-based. For example, one popular mutation operator is to change one arithmetic binary operator with another one (e.g., 𝑎 + 𝑏 ↦→ 𝑎 − 𝑏). The majority of existing mutation testing approaches are rule-based, such as PIT [12], Major [39], JavaLanch [66], Proteum [16], Milu [32], AccMut [77], WinMut [76], etc. Then researchers proposed using deep learning techniques to generate mutations. For instance, DeepMutation [73] utilizes machine sequence-to-sequence neural translation methods, while LEAM [70] employs learning-guided rule-based expansion. However, these approaches suffer from the challenges in generating syntactic correct mutations and mutations with diverse forms and often fail to generate syntactically valid mutations. Perhaps the closest to our work is that of Degiovanni et al. who proposed 𝜇Bert [15]. 𝜇Bert generates mutation by masking the code elements to be mutated and then predicting the mutation via Bert [20].

Several recent studies have also explored the use of LLMs for mutation generation [14, 71]. However, they have not been thoroughly evaluated, as they only count the number of mutations they introduce and compute the mutation scores achieved by some test suites. In contrast, we thoroughly assess the utility of the mutations we are using (through both a quantitative and qualitative approach), contrast it with the key previous baseline methods, and address all experimental parameters and concerns related to the use of LLMs, i.e., data leakage, cost, and added value.

LLMs for Code. Recently, Large Language Models (LLMs) have shown remarkable potential in a wide range of natural language processing (NLP) [8] and coding tasks [22, 78]. Existing LLMs are mostly based on Transformer [74] and are trained by a large corpus of data, including texts and source codes. LLMs are driven by instructions in natural languages, i.e., prompts, which guide the models to perform target tasks and are crucial to performance. The design of prompts plays a crucial role in obtaining accurate and useful responses from these models.

Closed-source LLMs are proprietary models developed and maintained by companies or organizations, often offering advanced features and requiring a subscription or payment for access, while open-source LLMs are freely available models that allow users to view and modify them. For the models used in our study, GPT-3.5- Turbo and GPT-4-Turbo are closed-source LLMs released by OpenAI [4], while CodeLlama and StarChat are open-source. An opensource LLM is usually fine-tuned against a general LLM for specified tasks. For example, CodeLlama [64] is derived from Llama [72] by fine-tuning with extensive code datasets, while StarChat is built on StarCoder [45] to enhance conversational abilities.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.