posted on 2023-11-06, 11:34authored byLuis Fabricio Góes, Piotr Sawicki, Marek Grzes, Dan Brown, Marco Volpe
<p>In this paper, we investigate the ability of large languagemodels (LLMs), specifically GPT-4, to assess the funninessof jokes in comparison to human ratings. We use a datasetof jokes annotated with human ratings and explore differentsystem descriptions in GPT-4 to imitate human judges withvarious types of humour. We propose a novel method to cre-ate a system description using many-shot prompting, provid-ing numerous examples of jokes and their evaluation scores.Additionally, we examine the performance of different sys-tem descriptions when given varying amounts of instructionsand examples on how to evaluate jokes. Our main contribu-tions include a new method for creating a system descriptionin LLMs to evaluate jokes and a comprehensive methodol-ogy to assess LLMs’ ability to evaluate jokes using rankingsrather than individual scores.</p>