Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

5 DISCUSSION

5.1 Sensitivity to Chosen Experiment Settings

5.1.1 Experiment Setup of the Context Length. To compare the similarity to real defects, we deliberately select the context around the bug locations for generating mutations for all the approaches. To check the impact of context selection on performance, we conduct experiments on 60 bugs from Defects4J, comparing how different context lengths affect performance. In our previous experiments, the context length is set to three lines around the bug location, so we conduct experiments to compare the effects of two-line and one-line contexts, as shown in Table 9. To measure their similarity in performance, we use the Spearman coefficient and Pearson correlation, as shown in Table 10. We can see that their mutual similarity exceeds 0.95, which is above the typical threshold of 0.85 [40, 51, 69], and the 𝑝-values are all below 0.05. Different context lengths perform similarly, thus validating the setup of our experiment.

5.1.2 Impact of Different Few-Shot Examples. To measure the impact of different few-shot examples on the results, we conduct comparative experiments with various examples. First, we randomly select 3, 6, and 9 additional examples from QuixBugs, and ensure there is no overlap with the default 6 examples. Additionally, despite the risk of data leakage, we select 6 examples from the Defects4J dataset. The results are presented in Table 11.

We further compute their similarity, shown as Table 12. We can find that their similarity exceeds 0.95, significantly exceeding the typical threshold of 0.85, and the 𝑝-values are all below 0.05. Therefore, we can conclude the performance of different prompts is very similar.

5.1.3 Performance of Using the Same Number of Mutations. In Section 4, we did not restrict the number of mutations generated by each approach. To study the performance of each approach under fixed quantity conditions, we followed the setting of existing study [28, 70] and limited the number of mutations generated by all methods to the minimum produced by 𝜇Bert [15], which is 16,785. For counts exceeding this number, we randomly selected the specified number of mutations for analysis. We conducted 10 rounds of random sampling comparisons, and the results are presented in Table 13. Therefore, under fixed quantity conditions, the mutations generated by GPT-3.5-Turbo are still the closest to real bugs in behavior.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.