We have a data frame loaded in our environment with the entire text of all Harry Potter books.
Run the code chunk below to inspect it…
Each row of our data contains the entire text for each chapter of each book. But to analyse this data using the tools we have learned so far, we need to reduce each row down to a more meaningful unit of text, know as a token.
When we talk about ‘tidy text data’ we are referring to a table with one-token-per-row. In this case, we are going to define a token as a single word. We can then perform various forms of text analysis on a row-by-row basis and derive some insight from each token of text.
Luckily, there is a single function from the
tidytext package that will perform this laborious task for us called
unnest_tokens(). Fill in the blank below with the function name to see how it transforms the Harry Potter data set.
output argument is name of the new column that is going to be created. Since we’re unnesting down to a single word, let’s call that
input argument is the name of the column in the current data set that contains the text. In our case that is
hp_text %>% ___(output = ___, input = ___)
hp_text %>% unnest_tokens(output = word, input = text)
Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr.
Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English.
We can remove stop words (kept in the
stop_words) with an
stop_words data set looks like this…
## # A tibble: 1,149 x 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ## 7 accordingly SMART ## 8 across SMART ## 9 actually SMART ## 10 after SMART ## # ... with 1,139 more rows
Fill out the blanks with the name of the function and the name of the data set containing stop words to filter out any stop words in the Harry Potter data.
hp_text %>% unnest_tokens(output = word, input = text) %>% ___(___, by = "word")
hp_text %>% unnest_tokens(output = word, input = text) %>% anti_join(stop_words, by = "word")
We now have our tokenized data frame with all stop-words removed saved as
hp_tidy as below. We’re now ready to conduct some analysis on the text!
## # A tibble: 409,338 x 3 ## book chapter word ## <fct> <int> <chr> ## 1 The Philosopher's Stone 1 boy ## 2 The Philosopher's Stone 1 lived ## 3 The Philosopher's Stone 1 dursley ## 4 The Philosopher's Stone 1 privet ## 5 The Philosopher's Stone 1 drive ## 6 The Philosopher's Stone 1 proud ## 7 The Philosopher's Stone 1 perfectly ## 8 The Philosopher's Stone 1 normal ## 9 The Philosopher's Stone 1 people ## 10 The Philosopher's Stone 1 expect ## # ... with 409,328 more rows
count() function passing the
word column as the first argument to count the occurances of each unique word in the data. Add a
sort = TRUE argument to bring the most common words to the top.
hp_tidy %>% count(___, sort = ___)
hp_tidy %>% count(word, sort = TRUE)
To get word counts on a by-book basis, we can simply add the
book column name to the
hp_tidy %>% count(___, word, sort = TRUE)
hp_tidy %>% count(book, word, sort = TRUE)
In this case, character names dominate the top word counts. If we wanted to get a look a top words that are not some of the most popular names in the books.
To do this you would create a character vector of the names you would like to remove like the below.
##  "harry" "potter" "dumbledore" "voldemort" ##  "snape" "sirius" "hermione" "ron" ##  "weasley" "draco" "malfoy" "hagrid" ##  "neville" "dobby" "moody" "lupin" ##  "bellatrix" "mcgonagall" "newt scamander" "grindelwald" ##  "tina" "queenie" "jacob" "harry's" ##  "ginny" "george"
Then use a
filter function to filter out words from the data that are in the
Hint: use a
! before the column name you are filtering to
filter to only
words that are not
%in% the vector you are using to filter by.
hp_tidy %>% ___(___ %in% ___) %>% count(word, sort = TRUE)
hp_tidy %>% filter(!word %in% hp_characters) %>% count(word, sort = TRUE)
tidytext package comes with 3 built-in sentiment lexicons: AFINN, Bing and NRC.
How does each lexicon differ in their measurement of sentiment?
## # A tibble: 2,476 x 2 ## word score ## <chr> <int> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ## 7 abhor -3 ## 8 abhorred -3 ## 9 abhorrent -3 ## 10 abhors -3 ## # ... with 2,466 more rows
## # A tibble: 6,788 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faced negative ## 2 2-faces negative ## 3 a+ positive ## 4 abnormal negative ## 5 abolish negative ## 6 abominable negative ## 7 abominably negative ## 8 abominate negative ## 9 abomination negative ## 10 abort negative ## # ... with 6,778 more rows
## # A tibble: 13,901 x 2 ## word sentiment ## <chr> <chr> ## 1 abacus trust ## 2 abandon fear ## 3 abandon negative ## 4 abandon sadness ## 5 abandoned anger ## 6 abandoned fear ## 7 abandoned negative ## 8 abandoned sadness ## 9 abandonment anger ## 10 abandonment fear ## # ... with 13,891 more rows
With data in a tidy format, sentiment analysis can be done as an inner join. This is another of the great successes of viewing text mining as a tidy data analysis task; much as removing stop words is an anti-join operation, performing sentiment analysis is an inner-join operation.
Inner-joining a lexicon to our data will reduce the data to only words that have a match in the lexicon and then join on the the sentiment column to the data set.
inner_join in the code chunk below to join the
bing sentiment lexicon to our data.
hp_tidy %>% ___(get_sentiments("___"))
hp_tidy %>% inner_join(get_sentiments("bing"))
Once we have joined the positive/negative sentiments from the bing lexicon to our data, we can use
mutate() to calculate the proportions in each book.
n()function with summarise to count the occurances of each group
n / sum(n)with
mutate()to get the proportions and then
hp_tidy %>% inner_join(get_sentiments("bing")) %>% group_by(___, ___) %>% summarise(n = ___) %>% mutate(prop = n / sum(n)) %>% ungroup()
hp_tidy %>% inner_join(get_sentiments("bing")) %>% group_by(book, sentiment) %>% summarise(n = n()) %>% mutate(prop = n / sum(n)) %>% ungroup()
hp_props <- hp_tidy %>% inner_join(get_sentiments("bing")) %>% group_by(book, sentiment) %>% summarise(n = n()) %>% mutate(prop = n / sum(n)) %>% ungroup()
Once we’ve done some analysis, it’s good idea to visualise the results in a chart. We can use the
ggplot2 package from the tidyverse to do this.
We won’t have time to cover ggplot2 at great length today but it’s widely regarded as the best chart programming library in the world with near endless visualisation options made possible by the possibility of iteratively adding layers of data.
Combining visualisation and your data wrangling/analysis into a single scripted process is also a very powerful concept that can save you a lot of time by eliminating the need to export your data into another visualisation tool.
Below is a basic ggplot setup. Note that each line is linked with a
+ rather than a
ggplot() function takes a data argument as well as an aesthetics function
aes() in which you map columns of your data to coordinates, shapes or colours on the chart.
In this instance, we then add a
geom_col with position set to stack (stacked-bar chart), flip the axis to make the book names more readable then add a title to the chart.
Fill in the blanks to put the
book column on the x-axis,
prop column on the y-axis and map the
sentiment colour to the colour
fill. Finally give your chart an appropriate title and run the code to see what we get!
ggplot(data = hp_props, aes(x = ___, y = ___, fill = ___)) + geom_col(position = "stack") + coord_flip() + labs(title = "Give your chart a title here!", x = NULL, y = NULL)
ggplot(data = hp_props, aes(x = book, y = prop, fill = sentiment)) + geom_col(position = "stack") + coord_flip() + labs(title = "Proportion of sentiment in Harry Potter books", x = NULL, y = NULL)
Starting with the
hp_tidy data set. Can you:
inner_join()the AFINN sentiment lexicon (
scoreof each chapter in each book
hp_tidy %>% inner_join(get_sentiments("afinn")) %>% group_by(book, chapter) %>% summarise(score = sum(score)) %>% ungroup()
hp_scores <- hp_tidy %>% inner_join(get_sentiments("afinn")) %>% group_by(book, chapter) %>% summarise(score = sum(score)) %>% ungroup()
For this chart, lets show the computed scores over time (chapters) and create a separate chart for each book by using supplying the
book column to the
ggplot(data = hp_scores, aes(x = ___, y = ___, fill = ___ > 0)) + facet_wrap(~___, ncol = 1) + geom_col(show.legend = FALSE) + labs(title = "Add your title here...", x = "___", y = "___")
ggplot(data = hp_scores, aes(x = chapter, y = score, fill = score > 0)) + facet_wrap(~book, ncol = 2) + geom_col(show.legend = FALSE) + geom_hline(yintercept = 0, colour = "black") + labs(title = "Sentiment score by chapter", x = "Chapter", y = "Sentiment Score")
After you have a chart, try modifying some of the elements…
ncol =argument in the
scales = "free_x"argument inside
facet_wrap(). What has changed?
+) to use a different chart theme…
A typical tidy text analysis with R…
# your data set with a text column sent_analysis <- text_data %>% # tidy, remove stop words, join setntiments unnest_tokens(word, text) %>% anti_join(stop_words) %>% inner_join(get_sentiments("...")) %>% # do some analysis group_by(...) %>% summarise(...) %>% mutate(...) %>% ungroup() # visualise your results ggplot(sent_analysis, aes(...)) + geom_*() + labs(title = "...")
Next up we’re going to switch over to RStudio where we can go through all the steps of performing a text analysis of our own…