University of Leicester
Browse
321-Article Text-2474-1-6-20240418.pdf (745.52 kB)

Addressing comparability and retrieval issues in conversation corpora: A case study on the spoken British National Corpora (1994 and 2014), using the past perfect

Download (745.52 kB)
Version 2 2024-06-19, 10:58
Version 1 2024-04-25, 13:38
journal contribution
posted on 2024-06-19, 10:58 authored by Nicholas SmithNicholas Smith, Cristiano Broccias, Cathleen Waters

This paper addresses issues in comparison and analysis of conversation corpora. We focus on the demographically-sampled spoken portions of the British National Corpora (BNC), representing British English in 1994 and 2014, for the purposes of studying recent language change and sociolinguistic variation. Issues of comparability and representativeness of the two BNCs have been raised before (see Love 2020), with several measures taken to ensure backwards compatibility of the Spoken BNC2014 with its 1994 counterpart. However, we believe further considerations and solutions merit attention, relating to sampling, transcription, annotation, and corpus querying. The BNClab subcorpus (Brezina et al. 2018a), a sociolinguistic judgment sample derived from the parent BNCs, provides a very promising basis for analysis, although arguably its mixed geographical representativeness affects cross-time comparability. To address this, we make some proposals for modifying the BNClab subcorpus to improve comparability. Then, we use the modified sample to address issues in retrieval and quantification of grammatical constructions in the spoken BNCs, namely a) determining an appropriate frequency metric, b) retrieving a comprehensive but manageable set of examples from ‘messy’ spoken data, and c) handling transcription inaccuracies. Finally, we discuss the case study findings and wider methodological implications for users of these corpora.

History

Author affiliation

College of Social Sci Arts and Humanities/Education

Version

  • VoR (Version of Record)

Published in

Research in Corpus Linguistics

Publisher

Asociación Española de Lingüística de Corpus

eissn

2243-4712

Copyright date

2024

Available date

2024-06-19

Language

en

Deposited by

Dr Nicholas Smith

Deposit date

2024-04-23

Rights Retention Statement

  • No

Usage metrics

    University of Leicester Publications

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC