Attempting to recognise the writing foibles and habits of individuals is an old pursuit in forensics, and one which grew up in tandem with the now-abandoned study of characteristics of typewriters. In investigations, identification of an otherwise anonymous author via their writing style represents soft evidence that can lead to harder proof of provenance.
Twitter, one of the most investigated sources of social communication (for investigating authorities and scientific research groups alike), presents unusual problems in author identification due to the artificial restriction of messages to 140 characters – apparently leaving the author little room for the literary flourishes which might distinguish their work.
Though Twitter is considering abandoning the legacy limit, and though it has eased that limit a little within the extant restrictions, it still proves difficult to ascertain author signatures from such a cramped authoring environment.
However researchers at Canada’s McGill University have developed a new approach which uses neural network processing to help individuate authors within these ascetic limits, a set of parameters which also comprise the restrictions from which Twitter derived its original remit – SMS text-messaging. Learning Stylometric Representations for Authorship Analysis [PDF] examines the analysis of authorship, the techniques for which field of research date back to the 19th century, and proposes the use of neural networks to identify patterns which can actually be reinforced by the artificial restrictions at play.
The innovation in the research consists of the abandoning of dataset-specific and manual processes in favour of automated analyses, and leveraging the lexical n-gram model, which has been shown to be particularly effective in the identification of author gender, among other characteristics.
Signifiers at play include normal colloquialisms for everyday terms, such as describing good weather as ‘nice’ in favour over any other possible adjective, and also using the kind of Twitter-specific abbreviations and linguistic abuses which can prove characteristic of the user. In the latter case, machine learning has an interesting task to approach, since the use of repeated letters – such as ‘niceeeee’ can be confused with more valid and regular uses, such as ‘IEEE’.
Critically the researchers’ current study is limited to the English language, and the team acknowledges that the approaches it is developing will need to include other tongues in order to achieve validity as a forensic tool.