Skip to contents

Calculates the percentage of documents within each group that contain specific terms (words or n-grams). This gives us a different view of our groups to Weighted Log-odds.

Usage

calculate_gtc(df, group_var, text_var, ngram_n = 1, top_n = 20)

Arguments

df

A data frame containing the text data

group_var

Name of the grouping variable (quoted or unquoted)

text_var

Name of the text variable (quoted or unquoted)

ngram_n

Length of n-grams to consider (default: 1)

top_n

Number of top terms to return per group (default: 20)

Value

A data frame with group term coverage statistics

Details

GTC should be helpful in checking our assigned names, labels, or descriptions of groups. It is primarily an internal tool, and is unexpected to be useful in communicating results. The function can check n-grams up to `ngram_n=5`, but it's clear that n=5 n-grams should be present in a very low You will most likely want to set this parameter to 1 - for words, or 2 - for bigrams. Bigrams will be more informative than words, but their proportions will also be significantly lower.