Abstract and 1. Introduction

2. Methodology and 2.1. Research Questions

2.2. Data Collection

2.3. Data Labelling

2.4. Data Extraction

2.5. Data Analysis

3. Results and Interpretation and 3.1. Type of Problems (RQ1)

3.2. Type of Causes (RQ2)

3.3. Type of Solutions (RQ3)

4. Implications

4.1. Implications for the Copilot Users

4.2. Implications for the Copilot Team

4.3. Implications for Researchers

5. Threats to Validity

6. Related Work

6.1. Evaluating the Quality of Code Generated by Copilot

6.2. Copilot’s Impact on Practical Development and 6.3. Conclusive Summary

7. Conclusions, Data availability, Acknowledgments, CRediT authorship contribution statement and References

6.1. Evaluating the Quality of Code Generated by Copilot

Several studies focused on various aspects of code quality generated by Copilot. Siddiq et al. (2022) analyzed the prevalence of code smells in the datasets used to train code generation tools like Copilot, and observed the presence of 18 types of code smells in the suggestions provided by Copilot. Yetistiren et al. (2022) evaluate the validity, correctness, and efficiency of the Copilot generated code, and their results indicate that Copilot is capable of generating valid code with a success rate of 91.5%. Pearce et al. (2022) prompted Copilot to generate code related to highrisk network security vulnerabilities in order to investigate the conditions that may lead Copilot to suggest insecure code. They found that vulnerabilities were present in 40% of the cases. Nguyen and Nadi (2022) conducted a study using 33 LeetCode problems to evaluate the correctness and comprehensibility of Copilot in four different programming languages. They found that Copilot’s suggestions had low cyclomatic and cognitive complexity and did not show significant differences across programming languages. Moradi Dakhel et al. (2023) investigated the capabilities of Copilot in two different programming tasks and found that it was able to provide solutions for almost all basic algorithmic problems, but some of the solutions were buggy and not replicable. The study conducted by Asare et al. (2023) suggests that Copilot won’t generate code with the same vulnerabilities as those previously introduced by human developers overall. Sobania et al. (2022) evaluated the performance of Copilot on standard program synthesis benchmark problems and compared it with results from the genetic programming literature. They found that Copilot demonstrated more mature performance. Mastropaolo et al. (2023) aimed to investigate whether different but semantically equivalent natural language descriptions would lead to the generation of the same recommended functions. Their results indicate that differences between semantically equivalent descriptions could affect the correctness of the generated code. Al Madi (2023) investigated the readability and visual inspection of the code generated by Copilot. They revealed the importance of developers being cautious and vigilant when working with code generation tools such as Copilot. Gustavo et al. (2023) investigated the impact of programming with LLMs that support Copilot. They found that users assisted by LLMs produce critical security bugs at a rate no greater than 10% more than those not assisted.

6.2. Copilot’s Impact on Practical Development

Several studies focused on investigating the performance of Copilot in actual software development, as well as the opinions of software practitioners on it. Wang et al. (2023a) conducted an interview with 15 practitioners and then surveyed 599 practitioners from 18 IT companies regarding their expectations of code completion. They found that 13% of the participants had used Copilot as their code completion tool. Jaworski and Piotrkowski (2023) prepared a survey questionnaire consisting of 18 questions to investigate developers’ attitudes toward Copilot. The research findings indicate that most people have a positive attitude towards the tool, but few participants showed concerns about security issues associated with using Copilot. Imai (2022) conducted experiments with 21 participants to compare the effectiveness of Copilot paired with human programmers in terms of productivity and code quality. The results indicate that while Copilot can increase productivity by adding more lines of code, the generated code quality is lower due to the need to remove more lines of code during testing. Barke et al. (2023) observed 20 participants who collaborated with Copilot to complete different programming tasks in four languages, and found that the interaction with the programming assistant was bimodal in different collaboration mode. Bird et al. (2023) conducted three studies aimed at understanding how developers utilize Copilot. Their findings suggest that developers spent a lot of time assessing the suggestions generated by Copilot instead of completing their coding tasks. Peng et al. (2023) presented the results of a controlled experiment using Copilot as an AI collaborative programmer. They found that the experimental group who had access to Copilot completed tasks 55.8% faster than the control group. Zhang et al. (2023) investigated the programming languages, IDEs, associated technologies, implemented functionalities, advantages, limitations, and challenges when using Copilot. Vaithilingam et al. (2022) conducted a user study involving 24 participants to assess the usability of Copilot and its integration into the programming workflow. They found that while Copilot may not directly enhance the efficiency of completing programming tasks, it serves as a valuable starting point for programmers, saving time spent on searching. Liang et al. (2024) conducted a survey among software developers and found that the primary motivation for developers to use AI programming assistants is to reduce keystrokes, complete programming tasks quickly, and recall syntax. However, the impact of using these tools to help generate potential solutions is not significant. Gustavo et al. (2023) analyzed the use of Copilot for programming and compared it with earlier forms of programmer assistance. They also explored potential challenges that could arise when applying LLMs to programming. Ziegler et al. (2024) sought to assess the impact of Copilot on user productivity through a case study, aiming to align user perceptions with empirical data. Their research highlights the aspects in which Copilot has enhanced users’ coding productivity and how it achieves these improvements.

6.3. Conclusive Summary

Most of the prior studies utilized controlled experiments or surveys to evaluate the effectiveness of Copilot. Our research is grounded in the perspective of software developers, focusing on the real-world problems they encounter when using Copilot, exploring the underlying causes and viable solutions. By analyzing the study results, we aimed to provide insights for Copilot users, the Copilot team, and researchers. Besides, we collected data from three popular software development platforms and forums, i.e., GitHub Issues, GitHub Discussions, and SO, to ensure the comprehensiveness of our dataset.

Authors:

(1) Xiyu Zhou, School of Computer Science, Wuhan University, Wuhan, China ([email protected]);

(2) Peng Liang (Corresponding Author), School of Computer Science, Wuhan University, Wuhan, China ([email protected]);

(3) Beiqi Zhang, School of Computer Science, Wuhan University, Wuhan, China ([email protected]);

(4) Zengyang Li, School of Computer Science, Central China Normal University, Wuhan, China ([email protected]);

(5) Aakash Ahmad, School of Computing and Communications, Lancaster University Leipzig, Leipzig, Germany ([email protected]);

(6) Mojtaba Shahin, School of Computing Technologies, RMIT University, Melbourne, Australia ([email protected]);

(7) Muhammad Waseem, Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.