A new paper from researchers in Europe and the U.S. seeks to determine trends and indicators for political cohesion from a formidable dataset of speeches and debates in the UK’s House of Commons in the years from 1975-2015. Complex Politics: A Quantitative Semantic and Topological Analysis of UK House of Commons Debates [PDF] addresses the shortfall of empirical research into political science by applying quantitative semantics techniques and topological analysis to a dataset obtained via the transparency website theyworkforyou.com.

The subset qualifies as Big Data; though it excludes available material between 1935 and 1974, it comprises transcriptions or submissions of 3.7 million individual speeches over forty years of UK Parliament debate, processed by Python using Parliamentary sessions as unit points.

The paper – arguably a study of gas – declares itself an exploratory attempt to analyse political theory assumptions didactically, but even this initial foray reveals shifts and patterns over four decades of discussion in the commons, such as the fact that both Labour and Conservative speakers have a historical tendency to promote topics which, though evidently of interest to the speaker, are not widely debated in Parliament itself. Additionally the research found that topics tended to stop clustering during periods of political certainty:

‘The first thing one should notice is that the number of clusters in the network varies significantly over time. This is due to the clustering algorithm. To better study the results, we identified the political era which each session belongs to. Fewer clusters were detected during periods of political stability mainly in the years in which Margaret Thatcher (1979-1990), and Tony Blair (1997-2007) held office.’

'Complex Politics: A Quantitative Semantic and Topological Analysis of UK House of Commons Debates'

‘Complex Politics: A Quantitative Semantic and Topological Analysis of UK House of Commons Debates’

Analysis via a topographical Mapper reveals the rise and subsidence of several core topics. The topic ‘health care’, for instance, decreased dramatically throughout the 1980s and early 1990s, whilst the topic ‘welfare’ rises to take its place in the same period, and the topic ‘education’ embarks on a fluctuating but nonetheless relatively consistent rise throughout the lifetime of the data set. Additionally, for reasons unspecified, the topic ‘entertainment and media’ falls quite dramatically in importance in the late 1980s – though this might be attributed, one could guess, to the increased importance it had found in that decade due to the advent of the ‘video nasty’ phenomenon; or even to the rise in foreign travel that UK citizens enjoyed in the first years of the ‘yuppie era’.

Within the ‘entertainment’ topic, subsets of word trends are apparent throughout the period covered by the data. Early keywords for ‘entertainment’ included ‘author’, ‘local’ and ‘land’, whilst the latest keyword indicators are ‘pub’, ‘sport’, ‘dog’ and ‘beer’ (actually a less ‘digital’ result than many might have expected).

One term which goes into frank and ineffable decline from 1975 is ‘regional affairs’, reflecting Britain’s increasing interest in participating in the European sphere. It remains for further analysis to decide why this Eurocentric trend declined in the late 1990s and early 2000s, just as Britain decided not to participate in Europe’s currency unification.

However the researchers make clear that this initial work represents an attempt to establish new quantitative frameworks and new definition protocols for a field which has hitherto been relegated to the social sciences.

The research group eschewed the possibility of generating tag clouds or word count-based derived conclusions as falling into the purview of ‘qualitative’ analysis, and instead used a machine learning approach identified as Dynamic Topic Modelling (DTM), which identifies co-occurring words into specific groups and distributions.

The group anticipates developing sentiment analysis techniques from this work, and will attempt to ascertain whether topical and sentiment information derived from the unstructured data can be successfully used as a predictor of political success.