A Data-Driven Dive into UK Party Conference Leaders' Speeches

Accessing Post Source
We are still working on getting this site set up, so source code for this post is not yet available. Check back soon and you’ll be able to find it linked here.

Party conferences are a mainstay in British politics whereby politicians, party members and affiliated people descend on a chosen city in order to set the party agenda, raise funds and attempt to get a soundbite into the mainstream media. The hallmark of these conferences are the Leaders’ speeches, where the current head of the party aims to appeal to their party base or even attract some new voters through media coverage.

## Data Background

This analysis would not have been possible without the transcripts provided by British Political Speech. They describe themselves as “an online archive of British political speech and a place for the discussion, analysis, and critical appreciation of political rhetoric” and produce speeches dating back to 1895.

For my study, I aim to observe the nuances of party conference leadership speeches from 2010 to 2018. These dates were chosen as they coincide with a change in the British political landscape, following the 2010 election whilst still providing us with enough data to conduct meaningful analysis. For this study, I will only observe the three ‘mainstay’ political parties: the Conservatives, the Labour Party and the Liberal Democrats.

Upon importing and tidying the data, we can observe the 5 most used words within the speeches.

A tibble: 5 × 2
WordCount
<chr><int>
the8777
to5279
and5033
of3898
a3470

There are no surprises here. In fact, the top 5 most used words here are from the top 6 most used words in the English language according to the Oxford English Corpus, a text corpus comprising over 2 billion words.

Carrying on our analysis with these common words would create a dull analysis, so to counteract this, we will temporarily remove them. We do this using the tidytext package in R, which contains a comprehensive list of stop words. These are common words in the English language which would add nothing to certain parts of our analysis if they were to be included. A separate dataframe was created to store the non-stopwords which totalled 56,989 words, meaning that 105,883 words were removed.

It is worth noting, that there is no definitive list of stopwords. Instead, different words would be considered stopwords depending on the context, although we have just used a generic list for simplicity. We will see later in the post, a more nuanced way of handling uninformative words called TF-IDF.

We can now observe the most used 5 non-stopwords.

A tibble: 5 × 2
WordCount
<chr><int>
people1207
country684
government598
party577
britain568

This is more like what we would have expected the vocabulary of a Leader’s speech to look like.

We can go further and visualise the set of each party’s 100 most commonly used non-stopwords through wordclouds. This is done below in each party’s traditional colour. (Blue for Conservative, Red for Labour, Yellow for Liberal Democrat).

We can see that the words identified to be most common before appear most often in these wordclouds too (denoted by their large size). The only visible differences are the party’s names, particularly visible for both Labour and the Liberal Democrats. Here we see the obvious flaws in word clouds, they barely allow us to observe any differences between the parties and provide no numerical insight. We will aim to address this weakness later with different methods.

Before we dive into some more detailed text analysis, we could have a quick exploration of the word count for each Leader’s speech.

From this, we can see that Ed Miliband can be quite the rambler at times.

## Text Analysis

### Analysing Sentiment

Basic counts and summaries are great, but with modern data science techniques, we can go much further. For example, we can infer the sentiment (loosely, how positive or negative in tone) each speech is.

We do this by referencing the contents of each speech against the AFINN lexicon, created by Finn Arup Neilson. This lexicon assigns an integer value from -5 to 5 to a vast number English words with negative numbers indicating negative sentiment and positive numbers indicating positive sentiment. Here we list a random word for each sentiment value.

A grouped_df: 11 × 2
WordValue
<chr><dbl>
son-of-a-bitch-5
fraudulence-4
lunatics-3
lethargy-2
manipulation-1
some kind0
cool1
courtesy2
cheery3
winner4
superb5

We can then take an average of the sentiments over all word in each speech and visualise it to see the trend of speech sentiment over time. This is conducted on the dataset with stopwords included, otherwise it could distort the sentiment (though note that most stopwords have a neutral sentiment).

As we can see, the speeches are overwhelmingly positive, with the only negative score being Jeremy Corbyn’s 2018 speech to the Labour Party conference. Other notable values include David Cameron’s consistency between 2010 and 2015 for the Conservative Party and the significant jump in positivity when Theresa May took over the Conservative Leadership in 2016.

### Term Frequency and Zipf’s Law

We have seen that raw word counts on their own aren’t particularly useful. One flaw of many is that longer texts will naturally have higher word counts for all words. Instead, a more useful metric is how often a certain word (also called a term) appears as a proportion of all words. This is known as term frequency and defined as

$\text{Term Frequency} = \frac{\#\{\text{Occurrences of Term}\}}{\#\{\text{Occurrences of All Words}\}}$

So that we can compare across parties, we will look at term frequencies as a proportion of the occurrences of each term in all speeches by the party the term came from. We start by looking at the distribution of term frequencies for each party.

It should be noted that there are longer tails for these graphs that have not been shown. Instead, we have truncated the the really popular words such as ‘the’, ‘and’ and ‘to’ to make it easier to see the main body of the plot.

The plots all display a similar distribution for each party with many ‘rare’ words and fewer popular words.

It turns out that these long-tailed distributions are common in almost every occurrence of natural language. In fact, George Zipf, a 20th century American linguist created Zipf’s law. This formalises the above observation, stating that the frequency that a word appears in a text is inversely proportional to its rank.

$\text{Term Frequency} \propto \frac{1}{\text{Rank}}$

Put simply, the most frequent word will appear at twice the rate of the second most frequent word and at three times that of the third most frequent word.

Zipf’s law is largely accurate for many natural languages, including English (though as always, there are exceptions). For example, in the Brown Corpus of American English text, which contains slightly over 1 million words: ‘the’ appears the most times at ~70000 times, ‘of’ the second most at ~36000 times and ‘and’ the third most at ~29000, as would be roughly expected according to Zipf’s law.

We can attempt to visualise this law for our own text by plotting rank on the x-axis and term frequency on the y-axis, both on log scales.

Why the logs?
By definition, if two values $x$ and $y$ are inversely proportional, then we can find a constant $a$ such that $y = \frac{a}{x}$. Taking logarithms and rearranging gives $\log(y) = log(a) - log(x)$. In other words, $x$ and $y$ are inversely proportional if and only if their logarithms lie on a straight line with a negative slope.

We can see that all three parties have similar text structures largely obey Zipf’s Law. That said, we can see that our curve deviates from a straight line at the lower rank tail, suggesting that the most popular words in the speeches are being used more often than they would in a natural language. Additionally, we would expect a slope of approximately $-1$; by fitting a linear model (shown in grey), we obtain a coefficient which is close to this value.

### TF-IDF Analysis

We’ve seen that we can use a list of stop words to filter our data to leave only meaningful words. However, this list is fixed and not linked to our data in any way. We’ve already seen that ‘people’ is used very commonly in our speeches and so doesn’t provide that meaningful of an insight to us. Could construct a value that helped us to see the relative frequency of a term among our speeches, in order to see how important a word is to a specific speech compared to the others?

We can indeed. In fact the work has already been done for us in the form of a value value called the TF-IDF. It is calculated by multiplying the term frequency (TF) from earlier by a new value called the inverse document frequency (IDF).

$TF\cdot{IDF} = \left(\frac{\#\{\text{Occurrences of Term}\}}{\#\{\text{Occurrences of All Words}\}}\right) \cdot \log \left(\frac{\#\{\text{Documents}\}}{\#\{\text{Documents Containing Term}\}}\right)$

Loosely speaking, TF-IDF asks two questions:

• Is the specific term used more than expected in a given speech?
• Is it rare for a speech to contain a the specific term?
If the answer to both of these questions is “yes”, then TF-IDF is large, an the term is considered to be relatively important.

We can calculate the TF-IDF score for each word in each speech before using these to find the most ‘important’ word in each speech.

A grouped_df: 9 × 4
yearConservativeLabourLiberal Democrat
<int><chr><chr><chr>
2010harryrecognisesplural
2011eurobargainbarons
2012risesucceededmaurice
2013finishraceliberal
201440pethicdems
2015extremismkinderliberal
2016playsmigrantsbrexit
2017dreamgrenfellbrexit
2018proposalpalestinianbrexit

We obtain some interesting results here. For example, it’s clear to see the Liberal Democrats’ sharp pivot to a anti-Brexit strategy following the referendum of 2016. Or how in 2014, the Conservatives announced their plan to increase the 40% income tax threshold (known as the 40p tax rate). We also see Jeremy Corbyn’s plan for a ‘kinder’ politics emerge in his first conference speech as leader in 2015, alongside the Grenfell Tower disaster mentioned in 2017.

The names such as ‘Harry’ and ‘Maurice’ that crop up here were intriguing at first glance. These were in reference to ‘Harry Beckough’ and ‘Maurice Reeves’, who were, respectively, a longstanding Conservative member and a furniture shop owner whose premises was burned to the ground during the London riots.

### Complexity Consideration

There are a number of ways that we can observe the complexity of a text, or in this case a speech. For this piece we choose the average number of syllables per word. The data for this was taken from the quanteda package and we can visualise the results as so.

We can see profound variations between different leaders in this plot. Ed Miliband and David Cameron, the leaders of Labour and the Conservatives who gave speeches between 2010-2014 and 2010-2015, respectively, had a much lower complexity than the most recent leaders such as Jeremy Corbyn of Labour and Vince Cable of the Liberal Democrats, who together count for the top 6 most complex speeches.

We used mean syllable count in this piece as a metric for speech complexity as it is simple for a layperson to understand. That said, there are many more subtle and interesting complexity measures available through quanteda, such as the Flesch–Kincaid readability score.

### A Different Way of Deciding Elections?

The First Past the Post system is often bemoaned in the UK as being unsuitable for modern-day politics. Now, it is not my place to comment on this system but if pushed to suggest another system, the aforementioned Quanteda package does give us another option…

We can calculate the mean scrabble score per word of the leader’s party conference speech each year! First let us observe the most impressive efforts that the politicians managed:

A tibble: 5 × 5
<chr><int><chr><chr><dbl>
Conservative2018Theresa Mayczechoslovakia37
Liberal Democrat2013Nick Cleggunequivocally30
Labour2017Jeremy Corbynoverwhelmingly29
Labour2017Jeremy Corbyndemocratization29
Labour2015Jeremy Corbynfizzing29

Theresa May managed an incredible score of 37 in 2018 with ‘Czechoslovakia’ but this would of course be disqualified for being a proper noun. As a result, Nick Clegg holds the record with 30 points scored for ‘unequivocally’! We can also visualise the mean score per word as follows.

As we can see, the Conservatives, who have been in power since 2010 would not win a single year should it be decided by Scrabble. In fact, the Liberal Democrats would win 6 out of the 9 years we have studied with Labour, under Jeremy Corbyn, taking the other 3 years—I’m sure both parties would be happy with that in hindsight!

Just in case anyone was under any illusion, of course mean Scrabble score is a poor way of deciding elections and I am not endorsing its use—at the very least, a game of Pictionary would be more appropriate…

## Takeaways

With that, I end my brief incursion into British political speeches. While I have barely begun to scratch the surface of Natural Language Processing (NLP) methods, I hope that I have shown the power of the ways that these techniques can be used to summarise large pieces of text through sentiment, TF-IDF and syllable complexity.

I had minimal experience with NLP methods upon embarking on this project and would like to thank WDSS (in particular, Janique Krasnowska) for supporting me until completion. I feel like I’ve learned a lot and certainly furthered my knowledge and experience. I would suggest anyone who would like to conduct some data science studies outside their degree looks out for research opportunites with WDSS and seizes them with both hands—I will certainly be looking out for more chances!

## Appendix: Summary Table of All Speech Metrics

A tibble: 27 × 8
PartyYearLeaderNumber of WordsMean SentimentTop TF-IDF WordMean Word SyllablesMean Scrabble Score
<chr><int><chr><int><dbl><chr><dbl><dbl>
Conservative2010David Cameron62470.38866397harry1.4390207.435761
Conservative2011David Cameron61320.40983607euro1.4508087.538034
Conservative2012David Cameron60700.47016706rise1.4030327.367087
Conservative2013David Cameron59170.45477387finish1.4009137.401018
Conservative2014David Cameron61040.4963855440p1.3784807.304463
Conservative2015David Cameron66760.45546559extremism1.4437457.463356
Conservative2016Theresa May71870.88888889plays1.4761377.568521
Conservative2017Theresa May71130.63288719dream1.4857997.432207
Conservative2018Theresa May71200.54709419proposal1.4832967.571831
Labour2010Ed Miliband61680.40265487recognises1.4807917.348780
Labour2011Ed Miliband58910.56207675bargain1.3934827.288363
Labour2012Ed Miliband73900.40898345succeeded1.3881677.134647
Labour2013Ed Miliband79540.74186992race1.3707247.068717
Labour2014Ed Miliband56970.79874214ethic1.4376647.322393
Labour2015Jeremy Corbyn71780.47313692kinder1.5302577.609025
Labour2016Jeremy Corbyn58940.43455497migrants1.5698057.809402
Labour2017Jeremy Corbyn59650.08439898grenfell1.5852357.864050
Labour2018Jeremy Corbyn5703-0.04136253palestinian1.5641887.706432
Liberal Democrat2010Nick Clegg43540.16060606plural1.4574257.571065
Liberal Democrat2011Nick Clegg42570.06571429barons1.4715837.604947
Liberal Democrat2012Nick Clegg43280.33333333maurice1.4688377.439471
Liberal Democrat2013Nick Clegg59270.49206349liberal1.4753797.538670
Liberal Democrat2014Nick Clegg62410.33682008dems1.4940747.640180
Liberal Democrat2015Tim Farron58040.12616822liberal1.4624277.327490
Liberal Democrat2016Tim Farron61780.21428571brexit1.4548407.391953
Liberal Democrat2017Vince Cable51390.20505618brexit1.5747627.818679
Liberal Democrat2018Vince Cable43640.08626198brexit1.5442557.863584
Author: Ewan Yeaxlee
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless otherwise specified.

Comment