Table of Links
2. Contexts, Methods, and Tasks
3.1. Quality and 3.2 Productivity
5. Discussion and Future Work
5.1. LLM, Your pAIr Programmer?
5.2. LLM, A Better pAIr Programmer?
5.3. LLM, Students’ pAIr Programmer?
6. Conclusion, Acknowledgments, and References
3 MIXED OUTCOMES
Literature reviews of human-human pair programming have suggested various benefits as well as mixed effects. In the industry context, according to Alves De Lima Salge and Berente [5], pair programming improves code quality, increases productivity, and enhances learning outcomes. However, according to Hannay et al. [31], pair programming improves quality and shortens duration, but it increases effort, higher quality comes at the expense of considerably greater effort, and reduced completion time comes with lower quality. In the education context, pair programming brings benefits including higher quality software, student confidence in solutions, increased assignment grades, exam scores, success/passing rates in introductory courses, and retention [29, 52, 83]. All the reviews on human-human pair programming acknowledged that even though meta-analysis can show an overall trend and significant effect size, individual studies could report contradictory outcomes (see examples in Table 1).
For human-AI pair programming, existing works mainly focus on quality, productivity, and satisfaction, and already demonstrated mixed results in quality and productivity [8, 35, 84] (see examples in Table 1). Additionally, there is not enough research for a comprehensive review, so we cannot reach any conclusion on the effectiveness of human-AI pair programming yet. It is also hard to compare the human-human and human-AI pair programming literature, as they differ in what outcomes and measurements they adopt.
Therefore, in the top rows of Table 1, we listed the most common outcome variables in both literature (quality, productivity, satisfaction, learning, and cost) and some sample work to demonstrate the mixed outcomes and various measures. We elaborate on the variety of ways to measure some of the listed outcomes as follows.
3.1 Quality
In human-human pair programming literature, quality can be measured using defect density, perceptual effort measure, readability, functionability, the number of test cases passed, code complexity, scores, expert opinions, etc. [5, 70, 79].
3.2 Productivity
In human-human pair programming literature, duration, effort, and productivity are all types of “efficiency” outcomes that involve time and accomplishment. Productivity can be measured in terms of the number of completed tasks in a fixed unit of time, duration can be measured as the amount of elapsed or total time used to complete a fixed number of
tasks to a certain standard, and effort can be measured as twice the duration, the person-hours required, etc. [5]. We use productivity as an aggregated outcome variable of different measures, for consistency with the human-AI literature.
In current human-AI works, some measures are arguably too simplified as evaluation metrics, for example, Imai [35] used the number of lines of added code as the measure of productivity; however, the nature of interaction with Copilot (tab to accept suggestions) is likely to contribute to more added lines in the human-Copilot condition, and how valid would it represent the notion of productivity is questionable.
Note that some researchers have examined programmers’ perceived productivity when working Copilot and found that it most strongly correlated with the general acceptance rate of AI-generated code [90]. This is not included in Table 1 to stay consistent with the human-human pair programming literature, as perceived productivity is a different measure than actual productivity.
3.3 Learning
In human-human pair programming literature, learning can be assessed by quantitative measures such as assignment grades, exam scores, passing rate, and retention rate, or qualitative measures of higher-order thinking skills [29, 52, 83].
3.4 Cost
In terms of cost, there is the observation that participants faced challenges in understanding and debugging Copilot’s generated code, which leads to the hypothesis that human-AI pair programming could cost additional efforts and hinder programmers’ task-solving effectiveness [12, 84]. However, Dakhel et al. [21] shows that although Copilot’s code could be less correct than human code, its bugs are easier to debug than human errors. There is currently no work that experimentally characterizes the costs of human-AI pair programming.
Summary: The literature on human-human pair programming has shown mixed results in many outcome variables, including quality, productivity, satisfaction, learning, and cost. For human-AI pair programming, or mostly human-Copilot in this paper, there are still only few works with incomprehensive measures, but a mixed outcome is also observed. We further review the potential causes of mixed outcomes of both modes of pair programming in Section 4.
Authors:
(1) Qianou Ma (Corresponding author), Carnegie Mellon University, Pittsburgh, USA ([email protected]);
(2) Tongshuang Wu, Carnegie Mellon University, Pittsburgh, USA ([email protected]);
(3) Kenneth Koedinger, Carnegie Mellon University, Pittsburgh, USA ([email protected]).
This paper is