I’ve jumped in on the development of the rewrite of TABARI, the automated coding system used to generate GDELT, and the Levant and KEDS projects before it. The new project, PETRARCH, is being spearheaded by the project leader Phil Schrodt and the development led by Friend of Bad Hessian John Beieler. PETRARCH is, hopefully, going to be more modular, written in Python, and have the ability to work in parallel. Oh, and it’s open-source.
One thing that I’ve been working on is the ability to extract features from newswire text that is not related to coding for event type. Right now, I’m working on numerical detection — extracting relevant numbers from the text and, hopefully, tagging it with the type of number that it is. For instance:
One Palestinian was killed on Sunday in the latest Israeli military operation in the Hamas-run Gaza Strip, medics said.
or, more relevant to my research and the current question at hand:
Hundreds of Palestinians in the Gaza Strip protested the upcoming visit of US President George W. Bush on Tuesday while demanding international pressure on Israel to end a months-old siege.
The question is, do any guidelines exist for converting words like “hundreds” (or “dozens”, “scores”, “several”) into numerical values? I’m not sure how similar coding projects in social movements have handled this. John has suggested the ranges used in the Atrocities Event Data (e.g. “several” = 5-24, “tens” = 50-99). What other strategies are there?