Working with right-to-left languages like Arabic in R can be a bit of a headache, especially when mixed with left-to-right languages (like English). Since my research involves a great deal of text analysis of Arabic news articles, I find myself with a lot of headaches. Most text analysis methods require some kind of normalization before diving into the actual analyses. Normalization includes things like removing punctuation, converting words to lowercase, stripping numbers out, and so on. This is essential for any kind of frequency-based analysis so that words such as don’t, Don’t, and dont are not considered unique words. After all, when dealing with human-generated text, typos and differences in presentation are bound to occur. Often times, normalizing also includes stemming words so that words such as think, thinking, and thinks are all stemmed to “think” as they all represent (basically) the same concept.

In English, this is fairly easy and pre-packaged in many of natural language processing software packages (including the tm package for R). However, the state of Arabic natural language processing is not nearly as advanced as its English counterpart and, further, is much less accessible to the lay practitioner. This is almost certainly because of three factors: 1) the dominance of English in general, 2) the over-representation of English speakers in the software development and natural language processing fields, and 3) these procedures are not as cut and dry as in other languages, especially with respect to stemming. Some examples of problems that normalizing Arabic text produces are the eliding of waw with a distinct word or leading a word with faa, baa, or kaf, or some combination of the two. It’s difficult to just remove leading waws, as there are words that begin with waw that would become garbled. Thus, it’s only safe to remove waws that precede articles (to be conservative).

All of this would be easier with a dictionary file that would stem words and only removing leading letters that didn’t add to the meaning. This exists, but is not easily accessible without knowledge of another programming language. Further, stemming Arabic produces some problems, as any Arabic student can tell you; it’s not uncommon for words with the same root to be antonyms.

Searching for help on doing this in R doesn’t produce any real help, so I kludged together (and wow are they kludge-y) some basic regular expressions and folded them into a function. They’re not perfect and I welcome additions in the comments (they still leave a lot of duplicated words that are slight variations on one another). Note I have not removed common prepositions and connectors, as the next step in text normalization usually involves trimming off words like these as they carry no meaning in a bag-of-words approach to text analysis (more on that in a later post). Please excuse my transliteration of individual letters. And, as with all regular expressions, there are probably 10,000 other ways to do this. Note that these require the tm package.

normalize_arabic <- function(x) {
	text_temp <- x
	text_temp <- gsub("\\p{P}", " ", text_temp, perl = TRUE) # Remove punctuation
	# Remove leading whitespace, remove extra spaces, remove non-letter, non-space characters
	text_temp <- gsub('^ ', '', stripWhitespace(gsub('[^\\p{L}\\p{Zs}]', '', text_temp, perl = TRUE)))
	text_temp <- stripWhitespace(gsub('\\x{0623}|\\x{0622}|\\x{0625}|\\x{0671}|\\x{0672}|\\x{0673}', 'ا', text_temp)) 
	# Normalize alefs with hamzas
	text_temp <- gsub('\\p{Zs}\\x{0648}*\\x{0627}\\x{0644}(?=\\p{L})', ' ', text_temp, perl = TRUE) 
	# Remove leading alef lam with optional leading waw
	text_temp <- gsub('^\\x{0627}\\x{0644}(?=\\p{L})', '', text_temp, perl = TRUE) 
	# Remove leading alef lam at start of string
	text_temp <- gsub('\\p{Zs}\\x{0648}*\\x{0644}{2,}(?=\\p{L})', ' ', text_temp, perl = TRUE) 
	# Remove leading double lam at start of string
	text_temp <- gsub('\\p{Zs}\\x{0648}*\\x{0643}\\x{0627}\\x{0644}(?=\\p{L})', ' ', text_temp, perl = TRUE) 
	# Remove leading kaf alef lam with optional waw
	text_temp <- gsub('\\p{Zs}\\x{0648}*\\x{0628}\\x{0627}\\x{0644}(?=\\p{L})', ' ', text_temp, perl = TRUE) 
	# Remove leading baa alef lam with optional waw
	text_temp <- gsub('\\p{Zs}\\x{0648}*\\x{0641}\\x{0627}\\x{0644}(?=\\p{L})', ' ', text_temp, perl = TRUE) 
	# Remove leading faa alef lam with optional waw
	text_temp <- gsub('\\p{Zs}\\x{0648}*\\x{0627}{2,}\\x{0644}*(?=\\p{L})', ' ', text_temp, perl = TRUE) 
	# Remove leading double alef with optional lam with optional leading waw
	text_temp <- gsub('(?<=\\p{L})\\x{0647}(?=\\p{Zs})', ' ', text_temp, perl = TRUE) 
	# Remove trailing haa
	text_temp <- gsub('(?<=\\p{L})\\x{0649}(?=\\p{Zs})', 'ي', text_temp, perl = TRUE) 
	# Normalize ending yeh
	text_temp <- gsub('(?<=\\p{L})\\x{064A}{2,}\\x{0646}(?=\\p{Zs})', '', text_temp, perl = TRUE) 
	# Remove trailing yeh yeh noon
	text_temp <- gsub('(?<=\\p{L})\\x{064A}\\x{0648}\\x{0646}(?=\\p{Zs})', '', text_temp, perl = TRUE) 
	# Remove trailing yeh waw noon
	text_temp <- gsub('(?<=\\p{L})\\x{0647}\\x{0647}*(?=\\p{Zs})', '', text_temp, perl = TRUE) 
	# Remove trailing haa or haa alef
	text_temp <- gsub('(?<=\\p{L})\\x{0647}\\x{0645}\\x{0627}*(?=\\p{Zs})', '', text_temp, perl = TRUE) 
	# Remove trailing haa meem and haa meem alef
	text_temp <- gsub('(?<=\\p{Zs})\\p{L}(?=\\p{Zs})', '', text_temp, perl = TRUE) 
	# Remove single letters such as waw and those produced by above normalization
	text_temp <- stripWhitespace(gsub('(\\p{Zs}$)|(^\\p{Zs})', '', text_temp, perl = TRUE)) 
	# Remove added, leading, trailing whitespace