ParseR combines functionality from the tidytext
package and the tidygraph
package to enable users to create network visualisations of common
terms in a data set.
We’ll play through an example by creating a bi-gram network from a sample of the data set included in the ParseR package.
# Generate a sample
set.seed(1)
example <- ParseR::sprinklr_export %>%
dplyr::sample_n(1000)
Each post will be broken down into bi-grams (i.e. pairs of words) and the 25 most frequent bi-grams will be returned.
counts <- example %>%
ParseR::count_ngram(text_var = Message, n = 2, top_n = 25)
Note that counts
is a list object:
class(counts)
## [1] "list"
It contains two objects:
counts %>%
purrr::pluck("view")
## # A tibble: 25 × 3
## word1 word2 ngram_freq
## <chr> <chr> <int>
## 1 hispanic heritage 503
## 2 heritage month 253
## 3 heritage celebration 39
## 4 celebrating hispanic 37
## 5 last day 33
## 6 national hispanic 33
## 7 beto orourke 32
## 8 bobby beto 30
## 9 hispanic caucus 30
## 10 lacks hispanic 30
## # … with 15 more rows
tbl_graph
object that can be used to produce a
network visualisation.
counts %>%
purrr::pluck("viz")
## # A tbl_graph: 27 nodes and 25 edges
## #
## # A directed acyclic simple graph with 3 components
## #
## # Node Data: 27 × 2 (active)
## word word_freq
## <chr> <int>
## 1 hispanic 638
## 2 heritage 531
## 3 month 275
## 4 day 107
## 5 celebration 96
## 6 us 94
## # … with 21 more rows
## #
## # Edge Data: 25 × 3
## from to ngram_freq
## <int> <int> <int>
## 1 1 2 503
## 2 2 3 253
## 3 2 5 39
## # … with 22 more rows
Now we can use the tbl_graph
object we generated using
count_ngrams()
to produce a network visualisation.
We can also use the term_context
function to plot the
most frequent preceding/proceeding terms for the term of our choice, and
gain a better understanding of how the term is used within the data.
latin_context <- example %>%
ParseR::term_context(text_var = Message,
term = "latin",
preceding_n = 2,
proceeding_n = 1,
top_n = 10)
The outcome contains two objects:
latin_context %>%
purrr::pluck("plot")
latin_context %>%
purrr::pluck("frequencies")
## # A tibble: 13 × 5
## `x-2` `x-1` x `x+1` n
## <chr> <chr> <chr> <chr> <int>
## 1 you love latin music 1
## 2 tastes of latin america 1
## 3 space coast latin festival 3
## 4 on any latin american 1
## 5 mexican and latin american 1
## 6 is a latin american 1
## 7 in the latin and 1
## 8 in the latin pride 1
## 9 halloween in latin american 1
## 10 flags of latin american 1
## 11 contemporary female latin american 1
## 12 by the latin ballet 1
## 13 and traditional latin american 1