Abstract and 1. Introduction

  1. Methods
  2. Quantitative Results and Creativity Support Index
  3. Qualitative Results from Focus Group Discussions
  4. Discussion
  5. Mitigations and Conclusion and Acknowledgments
  6. Ethical Guidance References

A. Related Work on Computational Humour, AI and Comedy

B. Participant Questionaire

C. Focus

4 QUALITATIVE RESULTS FROM FOCUS GROUP DISCUSSIONS

In this section, we summarize the major themes that emerged from the focus group discussions. Each is presented alongside supporting quotes from various participants (anonymised as p1, p2, etc.).

4.1 Use-cases of LLMs in comedy writing and quality of generated outputs

Participants described a diverse array of use cases for LLMs in their writing practice, including as a conversational brainstorming partner (p19), critic (p6), choreographic assistant (p3), translator (p1, p11, p12, p19, p20), and historical guru (p13). They generally reflected positively on the potential of LLMs to assist with some tasks within the comedy writing process. However, many participants commented on the overall poor quality of generated outputs, and the amount of human effort required to arrive at a satisfying result.

4.1.1 LLMs can be an effective first step for quickly generating content and structure. Participants described the utility of LLMs for generating content much faster than human writers. They described success with using LLMs to generate first drafts, which then required significant edits from human writers: “AI allows you to kind of get that s*** first draft immediately” (p6). Participant p14 called their initial output “a vomit draft that I know that I’m gonna have to iterate on and improve.” Many participants also described using LLMs to generate a structure for a sketch or other performance, of which they could then fill in the details—the LLM “spat out a scene which provided a lot of structure” (p17).

4.1.2 Generated outputs are generally of poor comedic quality. Many participants noted that they only used LLMs for setup and structure generation due to their inability to generate humorous outputs from the models: “the most bland, boring thing—I stopped reading it. It was so bad” (p6), “just consistently bad [...] didn’t really improve on the jokes” (p10). Some participants described particular aspects critical in comedy, and how LLMs seemed incapable of them: “AI generated material has a lack of agency. [...] lacking that little bit of urgency that shows it can be emotional” (p11).

4.1.3 Participants had difficulty steering LLMs away from bland and generic outputs. Six participants described LLM-generated outputs as “bland” or “generic,” making them poor producers of comedic or artistic material: “the words seem very generic. They lack that incisiveness that I often find with human written language” (p11); “if you zoom out on the story that it told, it wasn’t really a good story or a creative story” (p20). Participants also commented on various prompting approaches and their general lack of success at prompting the LLM to generate more specific or interesting responses: “no matter how much I prompt [...] it’s a very straightlaced, sort of linear approach to comedy” (p11).

4.1.4 The human writer still produces the humorous elements in co-written text. The importance of human writers in providing the comedic aspects of material written with LLMs was a common theme. Many participants commented that while LLMs could provide effective setup or structure, they often could not provide the humor: “usually it can serve in a setup capacity. I more often than not provide the punchline” (p17). When participants had success using LLMs in the writing process, they still attributed the best parts of produced output to the human in the loop: “the only thing again that is funny in what I gave you is the joke I put into the prompting” (p15).

4.1.5 Lack of concern over ownership over generated content. In response to questions about feelings of ownership over content that was co-written with LLMs, most participants felt little concern. For some, this was due to the poor quality of generated outputs: “most of the jokes I was writing [are] the level of, I will go on stage and experiment with it, but they’re not at the level of, I’d be worried if anyone took one of these jokes” (p14). For others, it was due to the amount of human effort required in improving generated outputs: “I don’t feel a lot of ownership, because there’s no finished product. If I could polish it, then it would feel more like it is mine” (p19).

4.2 Limitations introduced by moderation and safety filtering

Participants commented on the moderation and safety filtering applied to widely-available language models. They remarked that this moderation limited the creative agency of human writers using LLMs, by serving as an initial editor of the text and removing writers’ ability to self-moderate. They also expressed frustration at being unable to use LLMs to write about many themes common in comedy writing, including sexually-suggestive material (p3, p10, p13, p16), dark humor (p8), and offensive jokes (p10, p15, p20).

4.2.1 Moderation and safety filtering limits writers’ creative agency. Participants explained that self-moderation is a critical part of the writing process, and expressed frustration that moderation tools on LLMs interfered with that process: “the creative process is about going through stages of ‘this material isn’t good enough, it’s not right, or it’s offensive, it’s marginalizing people, I need to make it more acceptable.’ And I think AI models are a beginning to do that before you have a chance to explore” (p10). “It probably would be more interesting for a writer if there would be less moderation, because you can do the moderation in your own prompts. A writer is going to moderate themselves. If you’re writing with an AI, if you don’t like the bad stuff that it writes, you won’t use it” (p19). Participants described that this external source of moderation limited the creative control of the human writer: “it’s interesting if there’s less moderation, because... the end result is moderated in the way that the author wants it to be” (p19). Some explained that if filters were necessary, that the user should still have some degree of control over them: “I feel like the opportunity to set the filters should still be at the performers’ end” (p12).

4.2.2 Moderation limits writers’ ability to use AI with their preferred subject matter. In addition to affecting the creative writing process, participants commented that moderation tools limited their ability to write freely on subject matter of their choice. “Comedy’s about pushing the boundaries or pointing out how ridiculous something is, on the fringe of what’s acceptable. And so when a lot of your inputs are limited that way, it’s gonna make it harder to be what I consider funny” (p14). Multiple participants expressed difficulty using LLMs to write potentially-offensive humor: “I was a little bit disappointed that it wasn’t a little bit offensive. It could have been a fun scene” (p20). Participant p8 described challenges using LLMs for dark humor: “a lot of my stuff can have dark bits in it. And then it wouldn’t write me any dark stuff, because it sort of thought I was going to commit suicide. So it just stopped giving me anything.”

4.3 Marginalization of minority identities

Many participants commented on the challenges of using LLMs to write content which reflected perspectives and identities outside of the “Western” (p11, p18), “white” (p14), “heteronormative” (p10), “male” (p5, p15) mainstream. They attributed these difficulties to the moderation applied to model outputs; the data used to train the models; and prompting or other instruction-tuning techniques that aimed to “generalize” model outputs for a broad audience.

4.3.1 LLM-generated outputs reflect a particular set of ethics, values and norms. Participants expressed concern over the values reflected in the outputs of LLMs, and found them less useful when those values did not reflect those of their own cultures. Speaking as a member of the “majority,” participant p20 described that “we have a set of views of what we think is good, and our norms, and it just repeats, it behaves within these norms.” P11 questioned “whose ethics [and norms] are being enforced on these large language models?”, suggesting these were Western ones.

4.3.2 When prompted to reflect non-dominant identities, LLMs made only shallow adjustments. Many participants described their attempts to steer LLM outputs away from dominant narratives and stereotypical characters, and their dissatisfaction with the results. They explained that models’ adjustments in response to these prompts were surface-level, failing to truly reflect other identities, and described issues with the names of characters introduced by the LLM: “when I switched the whole conversation to Indian languages, it didn’t automatically change the names. It still was Maria, Evan, Lexi” (p18). “I specified that the scene was set in Sweden, but the names were not typically Swedish” (p20). P18 described their attempts to “Indianize” the model’s outputs by introducing Indian languages into the prompt: “it seemed very artificial from the perspective of just using languages, but it was not truly embedding itself into the culture” (p18).

4.3.3 Moderation makes LLMs less useful to minorities by suppressing content by and about marginalized identities. Many participants expressed frustration that their prompts would be rejected when they prompted the model to generate content from the perspective of someone of their identity. To them, the model not only seemed less capable of generating outputs which felt authentic to people from non-majority groups, but explicitly “othered” them by alluding that any content produced by someone of their background was potentially dangerous or non-inclusive. Participant p6 expressed frustration at the models’ delineation of what is acceptable and what needs to be sanitized: “it’s taking out the gay language of it to make it more appealing or more palpable. This is the whole premise of my show, who decides what is PC in the first place?” Similarly, participant p1 found that the model would not generate outputs from her point of view: “it’s all so politically correct–I wrote a comedic monologue about Asian women, and it says, ‘As an AI language model, I am committed to fostering a respectful and inclusive environment’.” Participant p5 highlighted the unevenness of this treatment of identity, remarking that while the model was “uncomfortable writing a monologue about an Asian woman, but I just asked it to write a comedy monologue from the perspective of a white man, and it did it” (p5).

4.3.4 Moderation makes LLMs less useful to minorities by suppressing topics important to people from marginalized identities. Participants described that not only were the models unlikely to generate content from marginalized perspectives, but also refused to engage with topics that might be important to people from those backgrounds. P14 was frustrated with “having to use the language of the oppressor... I couldn’t say ‘white supremacy’ or I couldn’t say ‘terrorist.’ I had to find another way to say the same thing, because it couldn’t work around those limitations” (p14). They posited that because these controversial topics were more likely to be important to people of color, this moderation introduced “just an extra hurdle, and I think people of color, and, I think, people coming from outside of a UN-type lens, they’re gonna run into those problems.”

4.4 Fundamental limitations of AI in contrast to human writers

While most participants felt the difficulties introduced by the moderation could be alleviated by different approaches to safety filtering or instruction tuning, they also commented on some more fundamental limitations of LLMs. They posited that LLMs would never be able to create human-level comedy, due to models’ inability to pull from personal experience, lack of perspective, and lack of context and situational awareness—features that are critical to good comedy.

4.4.1 AI’s inability to draw on personal experience is a fundamental limitation. Many participants described the centrality of personal experience in good comedy, which enables comedians to draw upon their memories, acquaintances, and beliefs to construct an authentic and engaging narrative: “very much related to who I am and my lived experience, as well as the place I am in” (p11). “I always draw from my experience, or my memories, or something someone said that stayed with me for many, many years – and I think that’s what makes literature interesting and unique” (p20). This experience, some participants said, enables them to effectively calibrate their writing: “I have an intuitive sense of what’s gonna work and what’s gonna not work based on so much lived experience and studying of comedy, but it is very individualized and I don’t know that AI is ever gonna be able to approach that” (p14). By contrast, LLMs could not perform such calibration: “it really had no idea how to punch up or punch down. It had no perspective, so it couldn’t take any risks in terms of jokes” (p6). Participants emphasized that perspective and point of view was a uniquely human trait, saying that “human comedians... add much more nuance and emotion and subtlety” due to their lived experience and relationship to the material (p16).

4.4.2 AI’s lack of context (understanding of its audience and location) is a fundamental limitation. In addition to its lack of personal experience, participants described LLMs’ lack of awareness of the context in which its comedic material would be delivered as another fundamental limitation. Multiple participants commented on the importance of understanding the effects of culture and geography on what material would land with an audience: “the kind of comedy that I could do in India would be very different from the kind of comedy that I could do in the UK, because my social context would change” (p11); “what works in LA isn’t gonna work in Raleigh, or what’s working in Chicago is not going to work in Albuquerque” (p17). This poses a fundamental challenge for LLMs, they argued, because they lacked any context beyond what is provided to them in the prompt: “comedy is all about subtext, and a lot of that subtext can be unspoken, about who’s on stage, what environment they’re in” (p14). To participant p11, this makes the LLM unable to adapt its material effectively, because it is “everywhere and nowhere all at once” (p11).

4.4.3 As a text-only medium, (current) LLMs are missing critical aspects of comedy: delivery and surprise. Many participants commented on the importance of delivery in a quality comedy routine: “any written text could be an okay text, but a great actor could probably make this very enjoyable” (p19). Given that current widelyavailable LLMs are primarily accessible through a text-based chat interface, they felt that the utility of these tools was limited to only a subset of the domains needed for producing a full comedic product. This, too, some participants argued, illustrates the fundamental need for humans in the comedy generation process: “AI is just generating content” (p18). A few participants further attributed their lack of success at generating humorous outputs with LLMs to the statistical methods by which the models were trained. By simply learning to predict the most likely next token of text, they hypothesized, models will be unable to produce the surprising and unique moments that are hallmarks of comedy: “the whole idea of humor is that it is surprising, and it is so human, and AI is only adept at regurgitating tropes” (p15). LLMs cannot produce truly original content, participant p14 argued, “because the context has already been written by other people.”

4.5 Concerns around data sources used to train LLMs

Participants also expressed various concerns pertaining to the data sources used to train current widely-available LLMs. They discussed the ethical issues with training models on copyrighted works; the possibilities of unintentionally plagiarizing works on which the models were trained; and the lack of diversity represented in the training data. However, they also acknowledged the importance of training data in model performance, and many expressed uncertainty around how to balance their ethical concerns with their desire for more effective and equitable models.

4.5.1 Participants criticized the training of models on copyrighted data, but acknowledged its positive impact on performance. Participants were acutely aware of the pending litigation over the training of models on copyrighted data at the time of the focus groups. Some participants, including participant p15, expressed sympathy for those whose work was included in the training data: “Sarah Silverman... spent years honing her voice and then an AI just scraped her content, and now you can tell an AI to write in the style of Sarah Silverman [...] I don’t think it’s ethical” (p15). Other participants took a more balanced view, echoing their concerns with training on copyrighted works while acknowledging the benefits: “I think they are overtrained on copyrighted work... but on the other hand, if we didn’t put all that stuff in there, it wouldn’t work as well” (p19). Some participants feared unintentionally plagiarizing authors’ whose works were included in the training data: “I cannot tell if someone has written something like this before. We know that it’s using statistics from previous texts to recreate this, that it can in principle only be based on what already exists” (p20). Some participants had suggestions on licensing models: “for music, licensing is something tracked and recognized” (p10) and “not against the tools, but I think there needs to be licensing agreements for the work that should be compensated” (p4).

4.5.2 The lack of diversity in the training data perpetuates majority viewpoints, to the detriment of people from underrepresented identities. Multiple participants hypothesized that they struggled to get the LLMs to produce authentic-sounding content because the models were trained heavily on data that did not represent people of their identity. A few participants described unsuccessful attempts to replicate content in the style of famous non-white comedians and writers (p11, p14). “If you’re only getting biased inputs, you’re only getting the writing from a really biased lens. So there’s not enough black voices in there to make an accurate black sounding voice” (p14). “As someone who lives in the global south, I am not looking to make a play in the form of Shakespeare or other Western literature. I am looking to write about an Indian author. I find all these language models really lacking in references or authorship styles from this part of the world” (p11). However, participants also questioned whether, given the other ethical concerns about training data, more training on these underrepresented voices was indeed a good thing: “should this AI be able to completely replicate a black comedic voice? There’s a line between being able to replicate a voice and then immediately going into cultural appropriation. I don’t know how a large language model could ever effectively walk that line” (p17).

Authors:

(1) Piotr W. Mirowski∗, Google DeepMind London, UK (piotrmirowski@deepmind.com);

(2) Juliette Love∗, Google DeepMind London, UK ( juliettelove@deepmind.com);

(3) Kory Mathewson, Google DeepMind Montréal, QC, Canada (korymath@deepmind.com);

(4) Shakir Mohamed, Google DeepMind London, UK (shakir@deepmind.com).


This paper is available on arxiv under CC BY 4.0 license.