Distributed language representation for authorship attribution

AbstractDistributed language representation (deep learning) has been applied successfully in different applications in natural language processing. Using this model, we propose and implement two new authorship attribution classifiers. In this perspective, a vector-space representation can be generated for each author or disputed text according to words and their nearby context. To determine the authorship of a disputed text, the cosine similarity between vector representations can be applied. The proposed strategies can be adapted without any difficulty to different languages (such as English and Italian) or genres (essays, political speeches, and newspaper articles). Evaluations using the k-nearest neighbors (k-NNs))and based on four test collections (the Federalist Papers, the State of the Union addresses, the Glasgow Herald, and La Stampa newspapers) indicate that the distributed language representation preforms well, providing sometimes better effectiveness than state-of-the-art methods such as k-NN, nearest shrunken centroids, chi-square, Delta, latent Dirichlet allocation, or multi-layer perceptron classifier.