We are often faced with the question of how conversations differ
between groups. One way of answering this question is to look at the
weighted log odds ratio (WLOs) for terms used in each group.
This value measures the strength of association between a word and a
target category in a corpus (or dataset). ParseR uses functions from the
tidylo
package to calculate these. This very nice
blog by Sharon Howard outlines the utility of WLOs compared to other
methodologies to indentify word importance, such as tf-idf and this
blog post by Tyler Schnoebelen is also recommended reading.
To demonstrate how to use the calculate_wlos
function,
we will use the example data included within the ParseR package.
# Example data
example <- ParseR::sprinklr_export
Calculate Weighted Log-Odds Ratios
Let’s say we want to compare how males and females talk about Hispanic Heritage Month.
# Remove rows with no gender information
example <- example %>%
dplyr::filter(SenderGender != "NA")
# Calculate WLOs
wlos <- ParseR::calculate_wlos(example,
topic_var = SenderGender,
text_var = Message,
top_n = 30
)
example
is a list object that contains two items:
-
view: a human-readable tibble that contains the weighted log-odds for each of the top_n = 30 terms
wlos$view
## # A tibble: 60 × 4 ## SenderGender word n log_odds_weighted ## <chr> <chr> <int> <dbl> ## 1 M heritage 147 1.38 ## 2 M celebration 43 1.42 ## 3 F power 13 2.12 ## 4 M festival 13 1.23 ## 5 F put 12 2.04 ## 6 F generations 11 1.95 ## 7 F paid 10 1.86 ## 8 F panel 10 1.86 ## 9 F sources 10 1.86 ## 10 F tribute 10 1.86 ## # ℹ 50 more rows
-
viz: a plot that visualises terms in the view tibble with term frequency on the x-axis, and weighted log-odds on the y-axis.
wlos$viz
What we can learn from the plot is that men appear more likely to associate Hispanic Heritage Month with the celebratory aspects, whereas women discuss the learnings they can take from it.
Advice on interpreting Weighted Log-Odds Ratios
As mentioned at the beginning of this document, WLO is a statistical measure used to determine which words are most strongly associated with a particular category. This is calculated by comparing the frequency of each word in the target category to its frequency in all other categories within the corpus (or dataset).
The value can be interpreted as the logarithm of the ratio of the probability of observing a word in the target category to the probability of observing the same word in all other categories combined. A WLO value > 0 indicates the word or phrase is more strongly associated with the target category than with other categories, whilst a value <0 indicates the others.
In other words, it’s important to remember that the magnitude of a WLO value reflects the strength of the association, but it is not directly interpretable as a probability or frequency. Rather, it reflects the logarithmic difference between two probabilities (or odds), and should be treated as a relative measure of association.
Therefore, when reporting WLO to clients, one must refrain from using phrases such as “This term is X times as likely to appear in Category A than Category B and C”, and instead use phrases such as “This term has a stronger association with Category A than Category B and C”.