Prompted by a tweet yesterday from Ella Wind, an editor at the great Arab commentary site Jadaliyya, I undertook the task of writing a very quick and dirty converter that takes Arabic or Persian text and converts it to the International Journal of Middle East Studies (IJMES) transliteration system (details here [PDF]). I’ve posted the actual converter here. It’s in very initial stages and I will discuss some of the difficulties of making it more robust below.

It’s nice that the IJMES has an agreed upon transliteration system; it makes academic work much more legible and minimizes quarrels about translation (hypothetically). For example, حسني مبارك (Hosni Mubarak) is transliterated as ḥusnī mubārak.

Transliterating, however, is a big pain. The transliterated characters are not in the ASCII character set [A-Za-z0-9] that is mostly used by English and other Western languages, and many of its characters are largely drawn from Unicode (e.g. ḥ). That means a lot of copy-pasta of individual Unicode characters from the character viewers in your OS or some text file that stores them.

When Ella posted the tweet, I thought that programming this would be a piece of cake. How hard would it be to write a character mapping and throw up a PHP interface? Well, it’s not that simple. There are a few problems with this.

1. Most Arabic writing does not include short vowels.

Arabic is a very precise language (I focus the rest of this article on Arabic because I don’t know much about Persian). There are no silent letters and vowels denote verb form and casing. But in most modern Arabic writing, short vowels are not written in because readers are expected to know them. For example, compare the opening of al-Faatiha in the Qu’ran with vowels:

بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ

to without them:

بسم الله الرحمن الرحيم

In the Qu’ran, all vowels are usually written. But this doesn’t occur in most modern books, signs, and especially newspaper and social media text.

So what does this mean for transliteration? Well, it means that you can’t transliterate words precisely unless the machine knows which word you’re going for. The average Arabic reader will know that بسم should be “bismi” and not “bsm.”

I can suggest two solutions to this problem: either use a robust dictionary that can map words without vowels to their voweled equivalent, or have some kind of rule set that determines which vowels must be inserted into the word. The former seems eminently more plausible than the latter, but even so, given the rules of Arabic grammar, it would be necessary to do some kind of part-of-speech tagging to determine the case endings of words (if you really want to know more about this twisted system, other people have explained this much better than I can). Luckily, most of the time we don’t really care about case endings.

In any case, short vowels are probably the biggest impediment to a fully automated system. The good news is that short vowels are ASCII characters (a, i, u) and can be inserted by the reader.

2. It is not simple to determine whether certain letters (و and ي) should be long vowels or consonants.

The letters و (wāw) and ي (yā’) play double duty in Arabic. Sometimes they are long vowels and sometimes they are consonants. For instance, in حسني (ḥusnī), yā’ is a long vowel. But in سوريا (Syria, or Sūriyā), it is a consonant. There is probably some logic behind when one of these letters is a long vowel and when it is a consonant. But the point is that the logic isn’t immediately obvious.

3. Handling dipthongs, doubled letters, and reoccurring constructions.

Here, I am thinking of the definite article ال (al-), dipthongs like وَ (au), and the shaddah ّ  which doubles letters. This means there probably has to be a look-ahead function to make sure that these are accounted for. Not the hardest thing to code in, but something to look out for nonetheless.

Those are the only things I can think of right now, although I imagine there are more lurking in the shadows that may jump out once one starts working on this. I may continue development on this, at least in an attempt to solve issues 2 and 3. Solving issue 1 is a task that will probably take some more thoughtful consideration.