One of the many great new packages to grace the R ecosystem in the last few years has been
tidytext by Julia Silge & David Robinson. It has allowed R users familiar with the
tidyverse suite of packages to apply these ‘tidy tools’ to text data.
If you’re not currently an R
tidyverse user fear not! This tutorial has been designed with beginners in mind and will introduce you the fundamentals of a ‘tidy’ data analysis pipeline and how to integrate natural language data into this workflow using the tools provided by
tidytext. It will also touch on using the
ggplot2 package to visualise the results of your analysis.
So let’s get cracking…
In R, we can use the
%>% operate to ‘pipe’ data from one function to another. Most functions in the
tidyverse take a data frame as the first input and send out a modified version of that data frame as the output. The pipe does the job of taking the output of one function and sending into the first input of the next function.
This means we can write code like the below, that starts with a data frame and ‘pipes’ it along a series of functions, each one receiving the output of the function above it, until we get the data into the shape we need it.
It also allows you to write very readable code that can be read like a sentence with the pipe linking the pieces of the story together with a “and then…” conjunctive.
data_set %>% do_this_thing() %>% now_this_thing() %>% one_last_thing()
Now to some data…
We have a data frame named
hp_text loaded in our environment with the entire text of all Harry Potter books!
In R, simply running code with the name of a data frame will print out a sample of its contents.
So run the code chunk below to have a look at what we’ve got…
Because this is a special type of data frame known as a ‘tibble’, we get a useful print out of the data telling us how many rows and columns we have, the type of data in each column (
<chr>), and a sample of the first 10 rows of data.
We can see our data has
text columns. Our first task is to tidy the data to prepare it for analysis…
Each row of our data contains the entire text for each chapter of each book. But to analyse this data using tidyverse tools, we need to reduce each row down to a more meaningful unit of text, know as a token.
When we talk about ‘tidy text data’ we are referring to a table with one-token-per-row. In this case, we are going to define a token as a single word. We can then perform various forms of text analysis on a row-by-row basis and derive some insight from each token of text.
Luckily, there is a single function from the
tidytext package that will perform this laborious task for us called
unnest_tokens(). Fill in the blank below with the function name to see how it transforms the Harry Potter data set.
output argument is name of the new column that is going to be created. Since we’re unnesting down to a single word, let’s call that
input argument is the name of the column in the current data set that contains the text. In our case that is
hp_text %>% ___(output = ___, input = ___)
hp_text %>% unnest_tokens(output = word, input = text)
Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr.
Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English.
tidytext package provides a
stop_words data set that looks like this…
## # A tibble: 1,149 x 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ## 7 accordingly SMART ## 8 across SMART ## 9 actually SMART ## 10 after SMART ## # ... with 1,139 more rows
We can remove stop words with an
anti_join(). This is known as a filtering join operation that works by removing any words in our data set that also appear in the stop words data set.
Fill out the blanks with the name of the function and the name of the data set containing stop words to filter out any stop words in the Harry Potter data. The
by = "word" argument tells the join function that we would like it to anti-join on the
word column in each data set.
hp_text %>% unnest_tokens(output = word, input = text) %>% ___(___, by = "word")
hp_text %>% unnest_tokens(output = word, input = text) %>% anti_join(stop_words, by = "word")
We now have our tokenized data frame with all stop-words removed saved as
hp_tidy as below. We’re now ready to conduct some analysis on the text!
## # A tibble: 409,338 x 3 ## book chapter word ## <fct> <int> <chr> ## 1 The Philosopher's Stone 1 boy ## 2 The Philosopher's Stone 1 lived ## 3 The Philosopher's Stone 1 dursley ## 4 The Philosopher's Stone 1 privet ## 5 The Philosopher's Stone 1 drive ## 6 The Philosopher's Stone 1 proud ## 7 The Philosopher's Stone 1 perfectly ## 8 The Philosopher's Stone 1 normal ## 9 The Philosopher's Stone 1 people ## 10 The Philosopher's Stone 1 expect ## # ... with 409,328 more rows
count() function passing the
word column as the first argument to count the occurrences of each unique word in the data. Add a
sort = TRUE argument to bring the most common words to the top.
hp_tidy %>% count(___, sort = ___)
hp_tidy %>% count(word, sort = TRUE)
To get word counts on a by-book basis, we can simply add the
book column name to the
hp_tidy %>% count(___, word, sort = TRUE)
hp_tidy %>% count(book, word, sort = TRUE)
In this case, character names dominate the top word counts. We may want to get a look a top words that are not some of the most popular names in the books.
To do this you would create a character vector of the names you would like to remove like the below.
##  "harry" "potter" "dumbledore" "voldemort" "snape" ##  "sirius" "hermione" "ron" "weasley" "draco" ##  "malfoy" "hagrid" "neville" "dobby" "moody" ##  "lupin" "bellatrix" "mcgonagall" "grindelwald" "tina" ##  "queenie" "jacob" "harry's" "ginny" "george"
Then use a
filter function to filter out words from the data that are in the
Hint: use a
! before the column name you are filtering to
filter to only
words that are not
%in% the vector you are using to filter by.
hp_tidy %>% ___(___ %in% ___) %>% count(word, sort = TRUE)
hp_tidy %>% filter(!word %in% hp_characters) %>% count(word, sort = TRUE)
tidytext package comes with 3 built-in sentiment lexicons: AFINN, Bing and NRC.
How does each lexicon differ in their measurement of sentiment?
## # A tibble: 2,476 x 2 ## word score ## <chr> <int> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ## 7 abhor -3 ## 8 abhorred -3 ## 9 abhorrent -3 ## 10 abhors -3 ## # ... with 2,466 more rows
## # A tibble: 6,788 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faced negative ## 2 2-faces negative ## 3 a+ positive ## 4 abnormal negative ## 5 abolish negative ## 6 abominable negative ## 7 abominably negative ## 8 abominate negative ## 9 abomination negative ## 10 abort negative ## # ... with 6,778 more rows
## # A tibble: 13,901 x 2 ## word sentiment ## <chr> <chr> ## 1 abacus trust ## 2 abandon fear ## 3 abandon negative ## 4 abandon sadness ## 5 abandoned anger ## 6 abandoned fear ## 7 abandoned negative ## 8 abandoned sadness ## 9 abandonment anger ## 10 abandonment fear ## # ... with 13,891 more rows
With data in a tidy format, sentiment analysis can be done as an inner join. This is another of the great successes of viewing text mining as a tidy data analysis task; much as removing stop words is an anti-join operation, performing sentiment analysis is an inner-join operation.
Inner-joining a lexicon to our data will reduce the data to only words that have a match in the lexicon and then join on the the sentiment column to the data set.
inner_join in the code chunk below to join the
bing sentiment lexicon to our data.
hp_tidy %>% ___(get_sentiments("___"))
hp_tidy %>% inner_join(get_sentiments("bing"))
Once we have joined the positive/negative sentiments from the bing lexicon to our data, we can use
mutate() to calculate the proportions in each book.
n()function with summarise to count the occurrences of each group
n / sum(n)with
mutate()to get the proportions and then
hp_tidy %>% inner_join(get_sentiments("bing")) %>% group_by(___, ___) %>% summarise(n = ___) %>% mutate(prop = n / sum(n)) %>% ungroup()
hp_tidy %>% inner_join(get_sentiments("bing")) %>% group_by(book, sentiment) %>% summarise(n = n()) %>% mutate(prop = n / sum(n)) %>% ungroup()
hp_props <- hp_tidy %>% inner_join(get_sentiments("bing")) %>% group_by(book, sentiment) %>% summarise(n = n()) %>% mutate(prop = n / sum(n)) %>% ungroup()
Once we’ve done some analysis, it’s good idea to visualise the results in a chart. We can use the
ggplot2 package from the tidyverse to do this.
We won’t have time to cover ggplot2 at great length in this tutorial, but it’s widely regarded as the best chart programming library in the world with near endless visualisation options made possible by the possibility of iteratively adding layers of data.
Combining visualisation and your data wrangling/analysis into a single scripted process is also a very powerful concept that can improve the quality and reproducibility of your data reporting.
Below is a basic ggplot set-up. Note that each line is linked with a
+ rather than a
ggplot() function takes a data argument as well as an aesthetics function
aes() in which you map columns of your data to coordinates, shapes or colours on the chart.
In this instance, we then add a
geom_col with position set to stack (stacked-bar chart), flip the axis to make the book names more readable then add a title to the chart.
Fill in the blanks to put the
book column on the x-axis,
prop column on the y-axis and map the
sentiment colour to the colour
fill. Finally give your chart an appropriate title and run the code to see what we get!
ggplot(data = hp_props, aes(x = ___, y = ___, fill = ___)) + geom_col(position = "stack") + coord_flip() + labs(title = "Give your chart a title here!", x = NULL, y = NULL)
ggplot(data = hp_props, aes(x = book, y = prop, fill = sentiment)) + geom_col(position = "stack") + coord_flip() + labs(title = "Proportion of sentiment in Harry Potter books", x = NULL, y = NULL)
Starting with the
hp_tidy data set. Can you:
inner_join()the AFINN sentiment lexicon (
scoreof each chapter in each book
hp_tidy %>% inner_join(get_sentiments("afinn")) %>% group_by(book, chapter) %>% summarise(score = sum(score)) %>% ungroup()
hp_scores <- hp_tidy %>% inner_join(get_sentiments("afinn")) %>% group_by(book, chapter) %>% summarise(score = sum(score)) %>% ungroup()
The sentiment score by chapter data frame you (hopefully) just built is now saved in the environment as
hp_scores. Let’s chart it!
For this chart, lets show the computed
scores over time (
chapters) and create a separate chart for each book by supplying the
book column to the
facet_wrap() function. We can also colour the bars differently depending on whether is
score is greater than 0 or not.
ggplot(data = hp_scores, aes(x = ___, y = ___, fill = ___ > 0)) + facet_wrap(~___, ncol = 1) + geom_col(show.legend = FALSE) + labs(title = "Add your title here...", x = "___", y = "___")
ggplot(data = hp_scores, aes(x = chapter, y = score, fill = score > 0)) + facet_wrap(~book, ncol = 2) + geom_col(show.legend = FALSE) + geom_hline(yintercept = 0, colour = "black") + labs(title = "Sentiment score by chapter", x = "Chapter", y = "Sentiment Score")
After you have a chart, try modifying some of the elements…
ncol =argument in the
scales = "free_x"argument inside
facet_wrap(). What has changed?
+) to use a different chart theme…
So that was brief introduction to tackling text analysis the tidy way using the
When applying these techniques yourself you will start to find many analyses follow a similar pattern…
# load the packages library(tidyverse) library(tidytext) # your data set with a text column sent_analysis <- text_data %>% # tidy, remove stop words, join setntiments unnest_tokens(word, text) %>% anti_join(stop_words) %>% inner_join(get_sentiments("...")) %>% # do some analysis group_by(...) %>% summarise(...) %>% mutate(...) %>% ungroup() # visualise your results ggplot(sent_analysis, aes(...)) + geom_*() + labs(title = "...")
But of course, we have only scratched the surface here. To learn more I would highly recommend diving head first into the following 2 free textbooks:
And if you’re in the London area and are looking to learn some R - Culture of Insight is currently running one-day R training workshops or in-house training days for companies and organisations.
Head over to the training page on our website to find out more and get in touch if you’re interested.