Tidy Text Analysis

Introduction

Tidy Tools

One of the many great new packages to grace the R ecosystem in the last few years has been tidytext by Julia Silge & David Robinson. It has allowed R users familiar with the tidyverse suite of packages to apply these ‘tidy tools’ to text data.

If you’re not currently an R tidyverse user fear not! This tutorial has been designed with beginners in mind and will introduce you the fundamentals of a ‘tidy’ data analysis pipeline and how to integrate natural language data into this workflow using the tools provided by tidytext. It will also touch on using the ggplot2 package to visualise the results of your analysis.

So let’s get cracking…

Key concept: the pipe %>%

In R, we can use the %>% operate to ‘pipe’ data from one function to another. Most functions in the tidyverse take a data frame as the first input and send out a modified version of that data frame as the output. The pipe does the job of taking the output of one function and sending into the first input of the next function.

This means we can write code like the below, that starts with a data frame and ‘pipes’ it along a series of functions, each one receiving the output of the function above it, until we get the data into the shape we need it.

It also allows you to write very readable code that can be read like a sentence with the pipe linking the pieces of the story together with a “and then…” conjunctive.

data_set %>% 
  do_this_thing() %>% 
  now_this_thing() %>% 
  one_last_thing()

Now to some data…

Our Test Data

We have a data frame named hp_text loaded in our environment with the entire text of all Harry Potter books!

In R, simply running code with the name of a data frame will print out a sample of its contents.

So run the code chunk below to have a look at what we’ve got…

hp_text

Because this is a special type of data frame known as a ‘tibble’, we get a useful print out of the data telling us how many rows and columns we have, the type of data in each column (<fct>, <int>, <chr>), and a sample of the first 10 rows of data.

We can see our data has book, chapter, and text columns. Our first task is to tidy the data to prepare it for analysis…

Step 1: Tidying Up

unnest_tokens()

Each row of our data contains the entire text for each chapter of each book. But to analyse this data using tidyverse tools, we need to reduce each row down to a more meaningful unit of text, know as a token.

When we talk about ‘tidy text data’ we are referring to a table with one-token-per-row. In this case, we are going to define a token as a single word. We can then perform various forms of text analysis on a row-by-row basis and derive some insight from each token of text.

Luckily, there is a single function from the tidytext package that will perform this laborious task for us called unnest_tokens(). Fill in the blank below with the function name to see how it transforms the Harry Potter data set.

The output argument is name of the new column that is going to be created. Since we’re unnesting down to a single word, let’s call that word.

The input argument is the name of the column in the current data set that contains the text. In our case that is text.

hp_text %>% 
  ___(output = ___, input = ___) 
hp_text %>% 
  unnest_tokens(output = word, input = text)

What did unnest_tokens() do?

  • Other columns have been retained
  • Punctuation has been stripped
  • Words have been converted to lower-case

Step 2: filtering out ‘stop words’

Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr.

Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English.

The tidytext package provides a stop_words data set that looks like this…

stop_words
## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows

We can remove stop words with an anti_join(). This is known as a filtering join operation that works by removing any words in our data set that also appear in the stop words data set.

animation by Garrick Aden-Buie @grrrck

animation by Garrick Aden-Buie @grrrck

Fill out the blanks with the name of the function and the name of the data set containing stop words to filter out any stop words in the Harry Potter data. The by = "word" argument tells the join function that we would like it to anti-join on the word column in each data set.

hp_text %>% 
  unnest_tokens(output = word, input = text) %>% 
  ___(___, by = "word")
hp_text %>% 
  unnest_tokens(output = word, input = text) %>% 
  anti_join(stop_words, by = "word")

Step 3: Word Counts

We now have our tokenized data frame with all stop-words removed saved as hp_tidy as below. We’re now ready to conduct some analysis on the text!

hp_tidy
## # A tibble: 409,338 x 3
##    book                    chapter word     
##    <fct>                     <int> <chr>    
##  1 The Philosopher's Stone       1 boy      
##  2 The Philosopher's Stone       1 lived    
##  3 The Philosopher's Stone       1 dursley  
##  4 The Philosopher's Stone       1 privet   
##  5 The Philosopher's Stone       1 drive    
##  6 The Philosopher's Stone       1 proud    
##  7 The Philosopher's Stone       1 perfectly
##  8 The Philosopher's Stone       1 normal   
##  9 The Philosopher's Stone       1 people   
## 10 The Philosopher's Stone       1 expect   
## # ... with 409,328 more rows

Word Counts

Use the count() function passing the word column as the first argument to count the occurrences of each unique word in the data. Add a sort = TRUE argument to bring the most common words to the top.

hp_tidy %>% 
  count(___, sort = ___)
hp_tidy %>% 
  count(word, sort = TRUE)

Word Count by Book

To get word counts on a by-book basis, we can simply add the book column name to the count() function…

hp_tidy %>% 
  count(___, word, sort = TRUE)
hp_tidy %>% 
  count(book, word, sort = TRUE)

Counts minus names

In this case, character names dominate the top word counts. We may want to get a look a top words that are not some of the most popular names in the books.

To do this you would create a character vector of the names you would like to remove like the below.

hp_characters
##  [1] "harry"       "potter"      "dumbledore"  "voldemort"   "snape"      
##  [6] "sirius"      "hermione"    "ron"         "weasley"     "draco"      
## [11] "malfoy"      "hagrid"      "neville"     "dobby"       "moody"      
## [16] "lupin"       "bellatrix"   "mcgonagall"  "grindelwald" "tina"       
## [21] "queenie"     "jacob"       "harry's"     "ginny"       "george"

Then use a filter function to filter out words from the data that are in the hp_characters vector.

Hint: use a ! before the column name you are filtering to filter to only words that are not %in% the vector you are using to filter by.

hp_tidy %>% 
  ___(___ %in% ___) %>% 
  count(word, sort = TRUE)
hp_tidy %>% 
  filter(!word %in% hp_characters) %>% 
  count(word, sort = TRUE)

Step 4: Sentiment Analysis

Sentiment Lexicons

The tidytext package comes with 3 built-in sentiment lexicons: AFINN, Bing and NRC.

How does each lexicon differ in their measurement of sentiment?

get_sentiments("afinn")
## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,466 more rows
get_sentiments("bing")
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

Sentiment analysis using inner_join()

With data in a tidy format, sentiment analysis can be done as an inner join. This is another of the great successes of viewing text mining as a tidy data analysis task; much as removing stop words is an anti-join operation, performing sentiment analysis is an inner-join operation.

Inner-joining a lexicon to our data will reduce the data to only words that have a match in the lexicon and then join on the the sentiment column to the data set.

animation by Garrick Aden-Buie @grrrck

animation by Garrick Aden-Buie @grrrck

Use inner_join in the code chunk below to join the bing sentiment lexicon to our data.

hp_tidy %>% 
  ___(get_sentiments("___"))
hp_tidy %>% 
  inner_join(get_sentiments("bing"))

Calculate proportion of positive/negative words in each book

Once we have joined the positive/negative sentiments from the bing lexicon to our data, we can use group_by(), summarise() and mutate() to calculate the proportions in each book.

  • First group the data by the book and sentiment columns
  • Then use the n() function with summarise to count the occurrences of each group
  • Finally we use n / sum(n) with mutate() to get the proportions and then ungroup()
hp_tidy %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(___, ___) %>% 
  summarise(n = ___) %>% 
  mutate(prop = n / sum(n)) %>% 
  ungroup()
hp_tidy %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(book, sentiment) %>% 
  summarise(n = n()) %>% 
  mutate(prop = n / sum(n)) %>% 
  ungroup()
hp_props <- hp_tidy %>% 
  inner_join(get_sentiments("bing")) %>% 
  group_by(book, sentiment) %>% 
  summarise(n = n()) %>% 
  mutate(prop = n / sum(n)) %>% 
  ungroup()

Visualise the results with ggplot2

illustration by Allison Horst @allison_horst

illustration by Allison Horst @allison_horst

Once we’ve done some analysis, it’s good idea to visualise the results in a chart. We can use the ggplot2 package from the tidyverse to do this.

We won’t have time to cover ggplot2 at great length in this tutorial, but it’s widely regarded as the best chart programming library in the world with near endless visualisation options made possible by the possibility of iteratively adding layers of data.

Combining visualisation and your data wrangling/analysis into a single scripted process is also a very powerful concept that can improve the quality and reproducibility of your data reporting.

Below is a basic ggplot set-up. Note that each line is linked with a + rather than a %>%.

The initial ggplot() function takes a data argument as well as an aesthetics function aes() in which you map columns of your data to coordinates, shapes or colours on the chart.

In this instance, we then add a geom_col with position set to stack (stacked-bar chart), flip the axis to make the book names more readable then add a title to the chart.

Fill in the blanks to put the book column on the x-axis, prop column on the y-axis and map the sentiment colour to the colour fill. Finally give your chart an appropriate title and run the code to see what we get!

ggplot(data = hp_props, aes(x = ___, y = ___, fill = ___)) +
  geom_col(position = "stack") +
  coord_flip() +
  labs(title = "Give your chart a title here!", x = NULL, y = NULL)
ggplot(data = hp_props, aes(x = book, y = prop, fill = sentiment)) +
  geom_col(position = "stack") +
  coord_flip() +
  labs(title = "Proportion of sentiment in Harry Potter books", x = NULL, y = NULL)

Challenge!

Starting with the hp_tidy data set. Can you:

  1. inner_join() the AFINN sentiment lexicon (get_sentiments("afinn"))
  2. Calculate the sentiment score of each chapter in each book
hp_tidy %>% 
hp_tidy %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(book, chapter) %>% 
  summarise(score = sum(score)) %>% 
  ungroup()
hp_scores <- hp_tidy %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(book, chapter) %>% 
  summarise(score = sum(score)) %>% 
  ungroup()

Visualise the results

The sentiment score by chapter data frame you (hopefully) just built is now saved in the environment as hp_scores. Let’s chart it!

For this chart, lets show the computed scores over time (chapters) and create a separate chart for each book by supplying the book column to the facet_wrap() function. We can also colour the bars differently depending on whether is score is greater than 0 or not.

ggplot(data = hp_scores, aes(x = ___, y = ___, fill = ___ > 0)) +
  facet_wrap(~___, ncol = 1) +
  geom_col(show.legend = FALSE) +
  labs(title = "Add your title here...", x = "___", y = "___")
ggplot(data = hp_scores, aes(x = chapter, y = score, fill = score > 0)) +
  facet_wrap(~book, ncol = 2) +
  geom_col(show.legend = FALSE) +
  geom_hline(yintercept = 0, colour = "black") +
  labs(title = "Sentiment score by chapter", x = "Chapter", y = "Sentiment Score")

After you have a chart, try modifying some of the elements…

  • What does changing the ncol = argument in the facet_wrap function do?
  • Add a scales = "free_x" argument inside facet_wrap(). What has changed?
  • Try adding another line (remember to use the +) to use a different chart theme… theme_light() perhaps?

Summary

So that was brief introduction to tackling text analysis the tidy way using the tidyverse and tidytext packages.

When applying these techniques yourself you will start to find many analyses follow a similar pattern…

# load the packages
library(tidyverse)
library(tidytext)

# your data set with a text column
sent_analysis <- text_data %>% 
  
  # tidy, remove stop words, join setntiments
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("...")) %>% 
  
  # do some analysis
  group_by(...) %>% 
  summarise(...) %>% 
  mutate(...) %>% 
  ungroup()

# visualise your results
ggplot(sent_analysis, aes(...)) +
  geom_*() +
  labs(title = "...")

Further Resources

But of course, we have only scratched the surface here. To learn more I would highly recommend diving head first into the following 2 free textbooks:

Training

And if you’re in the London area and are looking to learn some R - Culture of Insight is currently running one-day R training workshops or in-house training days for companies and organisations.

Head over to the training page on our website to find out more and get in touch if you’re interested.

fin.

Paul Campbell

Culture of Insight