ParseR combines functionality from the widyr package and the tidygraph package to enable users to create network visualisations of the pairwise correlations with specified terms.

We’ll play through an example using a sample of the data set included in the ParseR package.

# Generate a sample
set.seed(1)
example <- ParseR::sprinklr_export %>%
dplyr::sample_n(1000)

## Calculate the pairwise correlations

Each post will be broken down into individual words, then the words whose occurrence is correlated with terms that we’re interested in will be returned.

• The correlation we’re calculating and using here is called the phi coefficient and is denoted by $$\phi$$.
• It’s a measure of association for two binary variables.
• For a pair of words we can interpret it as how much more likely it is that both or neither of the words appear in a document than that either one appears alone.
• For more information check out either tidytextmining or wikipedia.
corrs <- ParseR::calculate_corr(
# We must specify the data set we're using
df = example,
# We must specify the text variable in our dataset
text_var = Message,
# We must specify terms we're interested in
terms = c("hispanic", # Can use single words
"hispanic heritage", # Can use multi-word phrases (e.g. brands, names)
"#hispanicheritagemonth"), # Can use hashtags
# We can specify a minimum term frequency
min_freq = 25,
# We can specify correlation limits
corr_limits = c(0, 1), # E.g. We only want positive correlations
# We can specify the top_n correlations to include
n_corr = 50,
# We can specify whether to include hashtags in the text
hashtags = TRUE)

Note that corrs is a list object:

class(corrs)
## [1] "list"

It contains two objects:

1. “view”
• A human-readable tibble with the top correlations involving our terms of interest.
corrs %>%
purrr::pluck("view")
## # A tibble: 52 × 3
##    from              to         correlation
##    <chr>             <chr>            <dbl>
##  1 hispanic          membership       0.502
##  2 hispanic          caucus           0.500
##  3 hispanic          beto             0.484
##  4 hispanic_heritage month            0.482
##  5 hispanic          refuses          0.475
##  6 hispanic          bobby            0.475
##  7 hispanic          lacks            0.475
##  8 hispanic          orourke          0.474
##  9 hispanic          flashback        0.457
## 10 hispanic          via              0.264
## # … with 42 more rows
1. “viz”
• A tbl_graph object that can be used to produce a network visualisation.
corrs %>%
purrr::pluck("viz")
## # A tbl_graph: 40 nodes and 52 edges
## #
## # An undirected multigraph with 1 component
## #
## # Node Data: 40 × 2 (active)
##   word              term_freq
##   <chr>                 <int>
## 1 hispanic                135
## 2 membership               31
## 3 caucus                   35
## 4 beto                     45
## 5 hispanic_heritage       503
## 6 month                   275
## # … with 34 more rows
## #
## # Edge Data: 52 × 3
##    from    to correlation
##   <int> <int>       <dbl>
## 1     1     2       0.502
## 2     1     3       0.500
## 3     1     4       0.484
## # … with 49 more rows

## Visualise the network

Now we can use the tbl_graph object we generated using calculate_corr() to produce a network visualisation.

corrs %>%
purrr::pluck("viz") %>%
ParseR::viz_corr()