vignettes/4_pairwise_correlation.Rmd
4_pairwise_correlation.Rmd
ParseR combines functionality from the widyr
package
and the tidygraph
package to enable users to create network visualisations of the
pairwise correlations with specified terms.
We’ll play through an example using a sample of the data set included in the ParseR package.
# Generate a sample
set.seed(1)
example <- ParseR::sprinklr_export %>%
dplyr::sample_n(1000)
Each post will be broken down into individual words, then the words whose occurrence is correlated with terms that we’re interested in will be returned.
corrs <- ParseR::calculate_corr(
# We must specify the data set we're using
df = example,
# We must specify the text variable in our dataset
text_var = Message,
# We must specify terms we're interested in
terms = c("hispanic", # Can use single words
"hispanic heritage", # Can use multi-word phrases (e.g. brands, names)
"#hispanicheritagemonth"), # Can use hashtags
# We can specify a minimum term frequency
min_freq = 25,
# We can specify correlation limits
corr_limits = c(0, 1), # E.g. We only want positive correlations
# We can specify the top_n correlations to include
n_corr = 50,
# We can specify whether to include hashtags in the text
hashtags = TRUE)
Note that corrs
is a list object:
class(corrs)
## [1] "list"
It contains two objects:
corrs %>%
purrr::pluck("view")
## # A tibble: 52 × 3
## from to correlation
## <chr> <chr> <dbl>
## 1 hispanic membership 0.502
## 2 hispanic caucus 0.500
## 3 hispanic beto 0.484
## 4 hispanic_heritage month 0.482
## 5 hispanic refuses 0.475
## 6 hispanic bobby 0.475
## 7 hispanic lacks 0.475
## 8 hispanic orourke 0.474
## 9 hispanic flashback 0.457
## 10 hispanic via 0.264
## # … with 42 more rows
tbl_graph
object that can be used to produce a
network visualisation.
corrs %>%
purrr::pluck("viz")
## # A tbl_graph: 40 nodes and 52 edges
## #
## # An undirected multigraph with 1 component
## #
## # Node Data: 40 × 2 (active)
## word term_freq
## <chr> <int>
## 1 hispanic 135
## 2 membership 31
## 3 caucus 35
## 4 beto 45
## 5 hispanic_heritage 503
## 6 month 275
## # … with 34 more rows
## #
## # Edge Data: 52 × 3
## from to correlation
## <int> <int> <dbl>
## 1 1 2 0.502
## 2 1 3 0.500
## 3 1 4 0.484
## # … with 49 more rows
Now we can use the tbl_graph
object we generated using
calculate_corr()
to produce a network visualisation.