Addressing comparability and retrieval issues in conversation corpora: A case study on the spoken British National Corpora (1994 and 2014), using the past perfect
This paper addresses issues in comparison and analysis of conversation corpora. We focus on the demographically-sampled spoken portions of the British National Corpora (BNC), representing British English in 1994 and 2014, for the purposes of studying recent language change and sociolinguistic variation. Issues of comparability and representativeness of the two BNCs have been raised before (see Love 2020), with several measures taken to ensure backwards compatibility of the Spoken BNC2014 with its 1994 counterpart. However, we believe further considerations and solutions merit attention, relating to sampling, transcription, annotation, and corpus querying. The BNClab subcorpus (Brezina et al. 2018a), a sociolinguistic judgment sample derived from the parent BNCs, provides a very promising basis for analysis, although arguably its mixed geographical representativeness affects cross-time comparability. To address this, we make some proposals for modifying the BNClab subcorpus to improve comparability. Then, we use the modified sample to address issues in retrieval and quantification of grammatical constructions in the spoken BNCs, namely a) determining an appropriate frequency metric, b) retrieving a comprehensive but manageable set of examples from ‘messy’ spoken data, and c) handling transcription inaccuracies. Finally, we discuss the case study findings and wider methodological implications for users of these corpora.
History
Author affiliation
College of Social Sci Arts and Humanities/EducationVersion
- VoR (Version of Record)
Published in
Research in Corpus LinguisticsPublisher
Asociación Española de Lingüística de Corpuseissn
2243-4712Copyright date
2024Available date
2024-06-19Publisher DOI
Language
enPublisher version
Deposited by
Dr Nicholas SmithDeposit date
2024-04-23Rights Retention Statement
- No