Measuring syntactical variation in Germanic texts

AbstractWe present two new measures of syntactic distance between languages. First, we present the ‘movement measure’ which measures the average number of words that has moved in sentences of one language compared to the corresponding sentences in another language. Secondly, we introduce the ‘indel measure’ which measures the average number of words being inserted or deleted in sentences of one language compared to the corresponding sentences in another language. The two measures were compared to the ‘trigram measure’ which was introduced by Nerbonne & Wiersma (2006, A Measure of Aggregate Syntactic Distance. In Nerbonne, J. and Hinrichs, E. (eds.) Linguistic Distances Workshop at the joint conference of International Committee on Computational Linguistics and the Association for Computational Linguistics, Sydney, July, 2006, pp. 82–90.). We correlated the results of the three measures and found a low correlation between the results of the movement and indel measure, indicating that the two measures represent different kinds of linguistic variation. We found a high correlation between the results of the movement measure and the trigram measure. The results of all of the three measures suggest that English is syntactically a Scandinavian language. Because of our unique database design we were able to detect asymmetric relationships between the languages. All three measures suggest that asymmetric syntactical distances could be part of the explanation why native speakers of Dutch more easily understand German texts than native speakers of German understand Dutch texts (Swarte 2016).