We are often faced with the question of how conversations differ
between groups. One way of answering this question is to look at the
weighted log odds ratio for terms used in each group. This
number tells us how many times more/less likely a word is used for that
group compared to the average within that dataset. ParseR uses functions
from the tidylo
package to calculate these.
To demonstrate how to use the calculate_wlos
function,
we will use the example data included within the ParseR package.
# Example data
example <- ParseR::sprinklr_export
Let’s say we want to compare how males and females talk about Hispanic Heritage Month.
# Remove rows with no gender information
example <- example %>%
dplyr::filter(SenderGender != 'NA')
# Calculate WLOs
wlos <- ParseR::calculate_wlos(example,
topic_var = SenderGender,
text_var = Message,
top_n = 30)
example
is a list object that contains two items:
view: a human-readable tibble that contains the weighted log-odds for each of the top_n = 30 terms
wlos$view
## # A tibble: 3,961 × 4
## SenderGender word n log_odds_weighted
## <chr> <chr> <int> <dbl>
## 1 F hispanic 254 -1.01
## 2 F heritage 201 -1.15
## 3 F twitter 179 -0.797
## 4 M hispanic 171 1.16
## 5 M heritage 147 1.32
## 6 M twitter 119 0.916
## 7 F hispanicheritagemonth 107 -0.157
## 8 F month 94 -0.583
## 9 M month 63 0.669
## 10 M hispanicheritagemonth 57 0.182
## # … with 3,951 more rows
viz: a plot that visualises terms in the view tibble with term frequency on the x-axis, and weighted log-odds on the y-axis.
wlos$viz
What we can learn from the plot is that men appear more likely to associate Hispanic Heritage Month with the celebratory aspects, whereas women discuss the learnings they can take from it.