Authorship attribution, constructed languages, and the psycholinguistics of individual variation

Abstract‘Authorship attribution’, the problem of determining the author (or the author's attributes, such as gender, age, native language, or other characteristics) by examining the writing style of an unknown work, is an important problem in applied linguistics. The theory of authorship attribution is relatively straightforward: language is an underspecified system, and people can pick and choose among several different ways to describe the same thing.

A method for content analysis applied to newspaper coverage of Japanese personalities in Brazil and Portugal

AbstractThis article reports a study that compared how Portuguese and Brazilian newspapers covered Japan in the 90s. The research was based on 9,152 texts related to Japan published in a Portuguese and a Brazilian newspaper from that era. This is a much larger sample than what was used in existing text content analysis studies for Portuguese. To treat this large sample, selected concordances and distributions obtained from the corpora were semi-automatically analyzed. Results revealed that the most referred Japanese personalities were politicians.

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

Springmann and Lüdeling describe the results of a case study that applies Neural
Network-based Optical Character Recognition (OCR) to scanned images of books
printed between 1487 and 1870 by training the OCR engine OCRopus on the RIDGES
herbal text corpus.

CorpusTracer: A CIDOC database for tracing knowledge networks

AbstractIn our research, we study mechanisms of knowledge dissemination based on the structural and social networks surrounding the edition history of a single text: the Tractatus de sphaera by Johannes de Sacrobosco. By applying methods from network analysis, we investigate how specific commentaries on the text circulated, which actors were responsible for them and what factors supported or hindered the spread of specific kinds of knowledge.

Distributed language representation for authorship attribution

AbstractDistributed language representation (deep learning) has been applied successfully in different applications in natural language processing. Using this model, we propose and implement two new authorship attribution classifiers. In this perspective, a vector-space representation can be generated for each author or disputed text according to words and their nearby context. To determine the authorship of a disputed text, the cosine similarity between vector representations can be applied.

Smoke and mirrors: Tracing ambiguity in texts

Abstract‘The corruption of man is followed by the corruption of language’ (Emerson, Nature and Selected Essays. Ziff, L. (ed.). London: Penguin Classics, 2003, p. 51). Corruption of language is our target. Sounding ambiguity in crisis writings versus fictions, we use tools that signal shades of meaning that allow for alternative reading. From Empson (Seven Types of Ambiguity.


