ParseR combines functionality from the tidytext package and the tidygraph package to enable users to create network visualisations of common terms in a data set.

We’ll play through an example by creating a bi-gram network from a sample of the data set included in the ParseR package.

# Generate a sample
set.seed(1)
example <- ParseR::sprinklr_export %>%
  dplyr::sample_n(1000)

Count the n-grams

Each post will be broken down into bi-grams (i.e. pairs of words) and the 25 most frequent bi-grams will be returned.

counts <- example %>%
  ParseR::count_ngram(text_var = Message, n = 2, top_n = 25)

Note that counts is a list object:

class(counts)
## [1] "list"

It contains two objects:

  1. “view”
  • A human-readable tibble with the most common n-grams.
counts %>%
  purrr::pluck("view")
## # A tibble: 25 × 3
##    word1       word2       ngram_freq
##    <chr>       <chr>            <int>
##  1 hispanic    heritage           503
##  2 heritage    month              253
##  3 heritage    celebration         39
##  4 celebrating hispanic            37
##  5 last        day                 33
##  6 national    hispanic            33
##  7 beto        orourke             32
##  8 bobby       beto                30
##  9 hispanic    caucus              30
## 10 lacks       hispanic            30
## # … with 15 more rows
  1. “viz”
  • A tbl_graph object that can be used to produce a network visualisation.
counts %>%
  purrr::pluck("viz")
## # A tbl_graph: 27 nodes and 25 edges
## #
## # A directed acyclic simple graph with 3 components
## #
## # Node Data: 27 × 2 (active)
##   word        word_freq
##   <chr>           <int>
## 1 hispanic          638
## 2 heritage          531
## 3 month             275
## 4 day               107
## 5 celebration        96
## 6 us                 94
## # … with 21 more rows
## #
## # Edge Data: 25 × 3
##    from    to ngram_freq
##   <int> <int>      <int>
## 1     1     2        503
## 2     2     3        253
## 3     2     5         39
## # … with 22 more rows

Visualise the network

Now we can use the tbl_graph object we generated using count_ngrams() to produce a network visualisation.

counts %>%
  purrr::pluck("viz") %>%
  ParseR::viz_ngram(emphasis = TRUE)

Term Context

We can also use the term_context function to plot the most frequent preceding/proceeding terms for the term of our choice, and gain a better understanding of how the term is used within the data.

latin_context <- example %>%
  ParseR::term_context(text_var = Message, 
                       term = "latin",
                       preceding_n = 2,
                       proceeding_n = 1,
                       top_n = 10)

The outcome contains two objects:

  1. “plot”
  • A graph object which displays the relationships between terms.
latin_context %>%
  purrr::pluck("plot")

  1. “frequencies”
  • A tibble which tells us how frequent the n-gram is in the data,
latin_context %>%
  purrr::pluck("frequencies")
## # A tibble: 13 × 5
##    `x-2`        `x-1`       x     `x+1`        n
##    <chr>        <chr>       <chr> <chr>    <int>
##  1 you          love        latin music        1
##  2 tastes       of          latin america      1
##  3 space        coast       latin festival     3
##  4 on           any         latin american     1
##  5 mexican      and         latin american     1
##  6 is           a           latin american     1
##  7 in           the         latin and          1
##  8 in           the         latin pride        1
##  9 halloween    in          latin american     1
## 10 flags        of          latin american     1
## 11 contemporary female      latin american     1
## 12 by           the         latin ballet       1
## 13 and          traditional latin american     1