Is GPT-4 Good Enough to Evaluate Jokes?
In this paper, we investigate the ability of large languagemodels (LLMs), specifically GPT-4, to assess the funninessof jokes in comparison to human ratings. We use a datasetof jokes annotated with human ratings and explore differentsystem descriptions in GPT-4 to imitate human judges withvarious types of humour. We propose a novel method to cre-ate a system description using many-shot prompting, provid-ing numerous examples of jokes and their evaluation scores.Additionally, we examine the performance of different sys-tem descriptions when given varying amounts of instructionsand examples on how to evaluate jokes. Our main contribu-tions include a new method for creating a system descriptionin LLMs to evaluate jokes and a comprehensive methodol-ogy to assess LLMs’ ability to evaluate jokes using rankingsrather than individual scores.
History
Author affiliation
School of Computing and Mathematical Sciences, University of LeicesterSource
14th International Conference on Computational Creativity 2023Version
- AM (Accepted Manuscript)