Quickly Clean Text Data — clean

An opinionated function which offers some helpful defaults for common text cleaning needs. The Function removes most URLs by default. Parallel processing is set to TRUE by default, which will speed the function up by ~3x on average.

Usage

clean_text(
  df,
  text_var = message,
  tolower = TRUE,
  remove_hashtags = TRUE,
  remove_mentions = TRUE,
  remove_all_non_ascii = TRUE,
  remove_punctuation = TRUE,
  remove_digits = TRUE,
  in_parallel = TRUE
)

Arguments

df: A tibble or data frame object containing the text variable the user wants to perform cleaning steps upon
text_var: The text variable with the message assigned to the observation that the user wishes to clean
tolower: Whether to convert all text to lower case?
remove_hashtags: Should hashtags be removed?
remove_mentions: Should any user/profile mentions be removed?
remove_all_non_ascii: Should non-ASCII characters be removed? Includes some accents (but not latin), foreign characters, emojis etc.
remove_punctuation: Should punctuation be removed?
remove_digits: Should digits be removed?
in_parallel: Whether to run the function in parallel (TRUE = faster *usually*)

Value

The data frame provided, with a cleaned text variable.

Details

The function will remove rows from the data frame if they those rows result in NA values once cleaning steps have been removed.

If using on a server, or inside an application, especially one that is deployed, you will most likely want to set `in_parallel` to FALSE. The function tries to remove emojis, non-ASCII characters, symbols etc. without moving latin accented letters.

The remove_emojis arugment was replaced with 'remove_all_non_ascii' to better reflect what the original emoji removal RegEx was doing.

Examples

if(interactive()){
#Performs all cleaning steps in parallel
cleaned_data <- clean_text(df = ParseR::sprinklr_export,
text_var = Message,
in_parallel = TRUE)

# If the user wants to perform all cleaning steps but keep capital letters and punctuation 
cleaned_data <- clean_text(df = ParseR::sprinklr_export,
text_var = Message,
tolower = FALSE,
remove_punctuation = FALSE,
in_parallel = TRUE)
}