Table of Links
2. Contexts, Methods, and Tasks
3.1. Quality and 3.2 Productivity
5. Discussion and Future Work
5.1. LLM, Your pAIr Programmer?
5.2. LLM, A Better pAIr Programmer?
5.3. LLM, Students’ pAIr Programmer?
6. Conclusion, Acknowledgments, and References
5.2 LLM, A Better pAIr Programmer?
As reviewed in Section 3, previous literature has explored a variety of measures to evaluate different aspects of human-human pair programming, while the current exploration in human-AI pair programming is quite limited. Murillo and D’Angelo [54] have proposed evaluation metrics for LLM-based creative code writing assistants in software engineering. More works could use more valid measures in the human-human pair programming literature to explore how to best help humans and LLM-based AI programming assistant collaborate together. It would also be interesting to have a study setup with three conditions – human-human, human-AI, and human solo – working on the same task.
Previous literature suggested some key factors in the success of human-human pair programming, as summarized in Table 1. These moderators that cause challenges for humanhuman pair programming may yield opportunities to explore in human-AI pair programming (Table 2). For example, self-efficacy can lead to a difference in satisfaction [81] and gender can lead to a difference in learning [47], do these compatibility moderators influence pAIr too? Can we improve pAIr outcomes using insights derived from human-human literature (e.g., simulate an AI partner with similar self-efficacy levels and the same gender)? Therefore, in general, we can ask the following questions for future works: Could these factors be implemented for human-AI pair programming? Would they make human-AI pair programming more effective, less effective, or have no influence, and why?
Task Types & Complexity. As we know from the human-human pair programming literature, a good collaborative task of the right complexity is important, but creating or choosing such tasks can be difficult. Meanwhile, LLMs help educators efficiently generate instructional materials such as questions [85], question-answers [40], feedback [20], and hints [61], which could be of similar quality as human-authored content. There is also work that suggested the preliminary success in using LLM to break down problems into subquestions [78]. Therefore, based on the insight from human-human pair programming literature and the known capacities of LLM, there is an open question to explore in human-AI pAIr programming: can LLM be configured to generate a task type with collaborative learning goals and customize task complexity for a programmer?
Compatibility - Expertise. In terms of the compatibility factor expertise, the pair programming literature suggests that matching partners with a similar level of expertise may be the best in promoting productivity and learning [5, 16, 31]. Evaluation studies show that GPT3-based models can be an above-average student in a CS1 classroom [22, 68] and its performance gets worse when the code becomes more complicated [89]. GPT4 even does better at solving introductory and basic programming problems (although its correctness is still not comparable to a developer in practice) [14]. We can also purposefully generate bugs and let the models make mistakes [38], so potentially, we may create an AI partner with a similar skill level to novice students. Future works can examine how to configure AI to adapt to student’s skill levels and whether it will be effective or not.
Other Compatibility Factors. Researchers have explored how to let LLMs generate interaction based on a designed persona and reasonably replicate human behavior [1, 34], and in education, Cao [15] let LLMs interact with students while role-playing as different fictional characters to help reduce students’ anxiety and increase motivation. There are possibilities to personalize an AI partner with different personality traits or the other pair compatibility factors like gender, ethnicity, and self-esteem that Salleh et al. [70] proposed. Potentially, it can be used to increase programmers’ motivation and/or engagement, but how useful it is for human-AI pair programming is yet to be examined.
Communication. For communication, we know the social aspect of a conversation matter [17] and that some types of discourse could be more effective to facilitate debugging [55] in human-human pair programming. Therefore, since LLM-based tools such as ChatGPT are able to simulate social interaction, it would be interesting to explore if LLM can support different types of communication, can the different components of communication be replicated in an LLM-based programming assistant, and whether it is effective or not.
Collaboration. In terms of collaboration, it is frequently reported that creating smooth collaboration is challenging in both industry [11] and educational context [57, 87]. Given that the free-rider problem reduce pair programming’s effectiveness [57] and regular role-switching potentially alleviates the driver’s cognitive load and ensures balanced learning outcomes [5, 83], it would be interesting to explore if LLM-based AI can be configured to avoid over-help, support role-switching, and how to best support the human-AI pair to collaborate.
Logistics. Logistics-wise, the use of Copilot as a programming partner may have the special advantage of avoiding scheduling logistics, but there are also concerns of accountability that need to be addressed [12, 22]. In general, there will be ethical risks and social implications of using AI in pair programming at the workplace and in educational contexts, which needs deeper examination in future works.
Authors:
(1) Qianou Ma (Corresponding author), Carnegie Mellon University, Pittsburgh, USA ([email protected]);
(2) Tongshuang Wu, Carnegie Mellon University, Pittsburgh, USA ([email protected]);
(3) Kenneth Koedinger, Carnegie Mellon University, Pittsburgh, USA ([email protected]).
This paper is