Authors:

(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]);

(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]);

(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]);

(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]);

(5) Jie M. Zhang, King’s College London, London, UK ([email protected]).

Abstract and 1 Introduction

2 Background and Related Work

3 Study Design

3.1 Overview and Research Questions

3.2 Datasets

3.3 Mutation Generation via LLMs

3.4 Evaluation Metrics

3.5 Experiment Settings

4 Evaluation Results

4.1 RQ1: Performance on Cost and Usability

4.2 RQ2: Behavior Similarity

4.3 RQ3: Impacts of Different Prompts

4.4 RQ4: Impacts of Different LLMs

4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations

5 Discussion

5.1 Sensitivity to Chosen Experiment Settings

5.2 Implications

5.3 Threats to Validity

6 Conclusion and References

4.2 RQ2: Behavior Similarity

The bottom three rows of Table 4 present a comparative evaluation of Behavior Metrics for mutation generation approaches.

4.2.1 Real Bug Detectability. GPT-3.5 detects 382 bugs of the 395 Defects4J bugs and 39 bugs of all 45 ConDefects bugs, i.e., 96.7% Defects4J bugs and 86.7% ConDefects bugs could be revealed by these mutations, achieving the best performance. CodeLlama-13b detects 358 bugs from Defects4J (i.e., 90.6%) and 30 bugs from ConDefects (i.e., 66.7%), respectively. Major achieves the second-best performance by detecting 362 Defects4J bugs (i.e., accounting for 91.6%) and 31 ConDefects bugs (i.e., accounting for 68.9%).

4.2.2 Coupling Rate. The Coupling Rate measures the degree of coupling between the generated mutations and their corresponding real bugs. GPT-3.5 exhibits coupling rates of 0.416 on Defects4J and 0.625 on ConDefects, respectively, achieving the best performance on both datasets. While CodeLlama-13b achieves coupling rates of 0.398 and 0.612, respectively

4.2.3 Ochiai Coefficient. The Ochiai Coefficient measures the semantic similarity between mutations and real bugs. GPT-3.5 leads with coefficients of 0.638 on Defects4J and 0.689 on ConDefects, outperforming CodeLlama-13b, which scores 0.39 and 0.378 on the respective datasets. Despite the notable performance gap between the two, their results are consistent across datasets. Major ranks second with coefficients of 0.519 on Defects4J and 0.6 on ConDefects.

4.3 RQ3: Impacts of Different Prompts

The left half of Table 7 presents the comparative results of GPT3.5 via different prompts listed in Section 3.5.3. Prompts P1 to P3 progressively simplify, each containing less information than its predecessor, while P4 is the most complex, enhancing P1 with test suite codes.

Overall, P1, the default prompt, excels in Compilability Rate and all Behavior Metrics. P2, created by removing few-shot examples from P1, leads in Average Generation Time, Useless Mutation Ratio, and Equivalent Mutation Rate, suggesting improved quality in compilable mutations. P3, provided only with the code element to be mutated, achieves the lowest cost for using the least tokens. Conversely, P4, which extends P1 with test suite codes, shows the lowest performance across all metrics, suggesting that GPT-3.5 cannot effectively utilize test suite data to enhance mutation quality.

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.