Abstract and 1. Introduction

2. Methodology and 2.1. Research Questions

2.2. Data Collection

2.3. Data Labelling

2.4. Data Extraction

2.5. Data Analysis

3. Results and Interpretation and 3.1. Type of Problems (RQ1)

3.2. Type of Causes (RQ2)

3.3. Type of Solutions (RQ3)

4. Implications

4.1. Implications for the Copilot Users

4.2. Implications for the Copilot Team

4.3. Implications for Researchers

5. Threats to Validity

6. Related Work

6.1. Evaluating the Quality of Code Generated by Copilot

6.2. Copilot’s Impact on Practical Development and 6.3. Conclusive Summary

7. Conclusions, Data availability, Acknowledgments, CRediT authorship contribution statement and References

2. Methodology

The research goal of this study is to systematically identity the problems that developers reported when using GitHub Copilot in development, as well as their underlying causes and potential solutions. We formulated three RQs to be answered in this study, which are detailed in Section 2.1, and Fig. 1 provides the overview of the research process.

2.1. Research Questions

RQ1: What are the problems faced by users while using Copilot in software development practice?

Rationale: GitHub Copilot is one of the most popular AI-assisted coding tools and has been widely used software development with 1.3 million paid users till Feb of 2024 (Wilkinson, 2024), and consequently it is important to understand the specific challenges and problems users face while using this tool in software development practice.

RQ2: What are the underlying causes of these problems?

Rationale: Understanding the causes of the problems identified in RQ1 is essential to developing effective solutions to address them. By identifying these causes, the study can provide insights into how to improve the design and functionality of Copilot.

RQ3: What are the potential solutions to address these problems?

Rationale: Exploring the solutions of the problems identified in RQ1 and the causes identified in RQ2 is essential to improving the user experience in practical development using Copilot. By identifying these solutions, the study can gain insight into potential solutions that enhance the functionality and usability of Copilot.

2.2. Data Collection

We collected data from three sources: GitHub Issues[1], GitHub Discussions[2], and SO posts[3]. GitHub Issues is a commonly used feature on GitHub for tracking bugs, feature requests, and reporting other issues related to software development projects, which allows us to capture the specific problems that users have encountered when coding with Copilot. GitHub Discussions is a feature provided by GitHub for open-ended discussions among project contributors and community members, which also offers a central hub for project-related discussions and knowledge sharing. The topics at GitHub Discussions can vary from technical questions and suggestions to usage issues associated with Copilot. Stack Overflow is a popular technology community that provides a public Q&A platform that addresses a broad spectrum of topics related to programming, development, and technology, which also includes inquiries about using Copilot.

Considering that Copilot was announced and started its technical preview on June 29, 2021, we chose to collect the data that were created after that date. The data collection was conducted on June 18, 2023. To answer RQ3, i.e., the solutions for addressing Copilot problems, we chose to collect closed GitHub issues, as well as answered GitHub discussions and SO posts. Specifically, for GitHub issues, we used “Copilot” as the keyword to search closed Copilot-related issues globally in the entire GitHub, and a total of 4,057 issues were retrieved. We also employed “Copilot” as a keyword to search answered posts in SO, resulting in 679 retrieved posts. Note that we did not use the “Copilot” tag for retrieval because the keyword-based method allows us to obtain a more exhaustive dataset. Different from GitHub issues and SO posts, GitHub discussions are organized into specific subcategories, with “Copilot” included as a subcategory under the overarching “Product” category. Given the high relevance of these discussions to Copilot, we collected all the 925 answered discussions under the “Copilot” subcategory.

2.3. Data Labelling

We conducted the data labelling on the collected data to filter out those that cannot be used for this study. The criteria for filtering is as follows: the issue, discussion, or post should contain specific information related to the use of GitHub Copilot.

2.3.1. Pilot Data Labelling

To minimize personal bias in the formal labelling process, the first and third authors conducted a pilot data labelling. For GitHub issues and discussions, we randomly selected 100 and 25 from each, making up 2.5% of the total count. Due to the small quantity of SO posts, we randomly selected 35, which constitutes 5% of the total posts. Selecting a certain proportion of data from different platforms respectively is to verify whether the criteria of the two authors are consistent across various data sources. The inter-rater reliability between the two authors was measured by the Cohen’s Kappa coefficient (Cohen, 1960), resulting in values of 0.824, 0.834, and 0.806, which indicate a reasonable level of agreement between the two authors. For any discrepancies in the results, the two authors engaged in discussions with the second author to reach a consensus. The results of pilot data labelling were compiled and recorded in MS Excel (Zhou et al., 2024).

2.3.2. Formal Data Labelling

The first and third authors then conducted the formal data labelling. During this process, we excluded a large amount of data not related to our research. For instance, “Copilot” may refer to other meanings in some situations, such as the “co-pilot” of an aircraft. Additionally, Copilot might be mentioned in a straightforward manner without additional information, like a post mentioned, “You can try using Copilot, which is amazing”. We also excluded such cases of data since they could not provide useful information about the usage of Copilot. During the labelling process, any result on which the two authors disagreed was subject to discussion with the second author until an agreement was reached. Ultimately, the two authors collected 476 GitHub issues, 706 GitHub discussions, and 142 SO posts. The data labelling results were compiled and recorded in MS Excel (Zhou et al., 2024).

Authors:

(1) Xiyu Zhou, School of Computer Science, Wuhan University, Wuhan, China ([email protected]);

(2) Peng Liang (Corresponding Author), School of Computer Science, Wuhan University, Wuhan, China ([email protected]);

(3) Beiqi Zhang, School of Computer Science, Wuhan University, Wuhan, China ([email protected]);

(4) Zengyang Li, School of Computer Science, Central China Normal University, Wuhan, China ([email protected]);

(5) Aakash Ahmad, School of Computing and Communications, Lancaster University Leipzig, Leipzig, Germany ([email protected]);

(6) Mojtaba Shahin, School of Computing Technologies, RMIT University, Melbourne, Australia ([email protected]);

(7) Muhammad Waseem, Faculty of Information Technology, University of Jyväskylä, Jyväskylä, Finland ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.