Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

3 STUDY DESIGN

3.1 Overview and Research Questions

Our paper intends to thoroughly investigate the capability of existing LLMs in mutation generation. We design research questions covering the evaluation of the default setting of our study, the impact of prompt engineering strategies and models, and root cause analysis for underperforming aspects. Specifically, we aim to answer the following five research questions.

• RQ1: How do the LLMs perform in generating mutations in terms of cost and usability?

• RQ2: How do the LLMs perform in generating mutations in terms of behavior similarity with real bugs?

• RQ3: How do different prompts affect the performance in generating mutations?

• RQ4: How do different LLMs affect the performance in generating mutations?

• RQ5: What are the root causes and compilation error types for the generation of non-compilable mutations?

3.2 Datasets

We intend to evaluate our approach with real bugs, and thus we need to use bug datasets with the following properties:

• The datasets should comprise Java programs, as existing methods are primarily based on Java, and we need to compare with them.

• The bugs of the datasets should be real-world bugs so that we can compare the difference between mutations and real bugs.

• Every bug in datasets has the correctly fixed version provided by developers so that we can mutate the fixed version and compare them with the corresponding real bugs.

• Every bug is accompanied by at least one bug-triggering test because we need to measure whether the mutations affect the execution of bug-triggering tests.

To this end, we employ the Defects4J v1.2.0 [37] and ConDefects [82] to evaluate the mutation generation approaches, shown as Table 1. In total, we conduct the experiments on 440 bugs.

Defects4J is a widely used benchmark in the field of mutation testing [15, 38, 40, 43, 55, 70], which contains history bugs from 6 open-source projects of diverse domains, ensuring a broad representation of real-world bugs. In total, these 6 projects contain 395 real bugs. However, from Table 2 and Table 1, we observe the time spans of the Defects4J bugs are earlier than the LLMs’ training time, which may introduce data leakage. Therefore, we supplement another dataset, ConDefects [82], designed to address the data leakage concerns. ConDefects consists of tasks from AtCoder [2] programming contest. To prevent data leakage, we exclusively use bugs reported after the LLMs’ release date, specifically those identified on or after August 31, 2023, and in total we collect 45 Java programs from ConDefects.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.