Markov Chain Monte Carlo for generating ranked textual data

Cerqueti, R; Ficcadenti, V; Dhesi, G; Ausloos, M

File(s) under permanent embargo

Reason: 12 month publisher embargo on AAM - requested from author

Markov Chain Monte Carlo for generating ranked textual data

journal contribution

posted on 2022-10-11, 13:39 authored by R Cerqueti, V Ficcadenti, G Dhesi, M Ausloos

This paper faces a central theme in applied statistics and information science, which is the assessment of the stochastic structure of rank-size laws in text analysis. We consider the words in a corpus by ranking them on the basis of their frequencies in descending order. The starting point is that the ranked data generated in linguistic contexts can be viewed as the realisations of a discrete states Markov chain, whose stationary distribution behaves according to a discretisation of the best fitted rank-size law. The employed methodological toolkit is Markov Chain Monte Carlo, specifically referring to the Metropolis–Hastings algorithm. The theoretical framework is applied to the rank-size analysis of the hapax legomena occurring in the speeches of the US Presidents. We offer a large number of statistical tests leading to the consistency of our methodological proposal. To pursue our scopes, we also offer arguments supporting that hapaxes are rare (“extreme”) events resulting from memory-less-like processes. Moreover, we show that the considered sample has the stochastic structure of a Markov chain of order one. Importantly, we discuss the versatility of the method, which is considered suitable for deducing similar outcomes for other applied science contexts.

History

Author affiliation

School of Business, University of Leicester

Version

VoR (Version of Record)

Published in

Information Sciences

Volume

610

Pagination

425 - 439

Publisher

Elsevier BV

issn

0020-0255

Copyright date

2022

Publisher DOI

https://doi.org/10.1016/j.ins.2022.07.137

Language

en

Publisher version

https://www.sciencedirect.com/science/article/pii/S0020025522008271?via=ihub

Usage metrics

Keywords

Uncategorised value

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) under permanent embargo

Markov Chain Monte Carlo for generating ranked textual data

History

Author affiliation

Version

Published in

Volume

Pagination

Publisher

issn

Copyright date

Publisher DOI

Language

Publisher version

Usage metrics

Categories

Keywords

Licence

Exports