I’ve jumped in on the development of the rewrite of TABARI, the automated coding system used to generate GDELT, and the Levant and KEDS projects before it. The new project, PETRARCH, is being spearheaded by the project leader Phil Schrodt and the development led by Friend of Bad Hessian John Beieler. PETRARCH is, hopefully, going to be more modular, written in Python, and have the ability to work in parallel. Oh, and it’s open-source.

One thing that I’ve been working on is the ability to extract features from newswire text that is not related to coding for event type. Right now, I’m working on numerical detection — extracting relevant numbers from the text and, hopefully, tagging it with the type of number that it is. For instance:

One Palestinian was killed on Sunday in the latest Israeli military operation in the Hamas-run Gaza Strip, medics said.

or, more relevant to my research and the current question at hand:

Hundreds of Palestinians in the Gaza Strip protested the upcoming visit of US President George W. Bush on Tuesday while demanding international pressure on Israel to end a months-old siege.

The question is, do any guidelines exist for converting words like “hundreds” (or “dozens”, “scores”, “several”) into numerical values? I’m not sure how similar coding projects in social movements have handled this. John has suggested the ranges used in the Atrocities Event Data (e.g. “several” = 5-24, “tens” = 50-99). What other strategies are there?

  • Jay Ulfelder

    I feel your pain. Maybe think about categories based on a logarithmic scale and assign keywords to bins based on those? So, for example, “several” and “a few” go into 1s, “dozens” and “scores” into 10s, “hundreds” and “several hundred” into 100s, “thousands” into 1,000s, “tens of thousands” into 10,000s.

    • That’s definitely a possible solution. And probably one that’d work reasonably well, too, if the goal would be to construct variables based on the type of number. “Injured” numbers tend to be approximates, while deaths and monetary sums are generally most specific.

  • Jonathan

    You might do a survey of articles where the descriptive term, “scads” or whatever, is used next to an exact value, or the exact value is available from another source. This is a lot of work, but might be a publishable result in itself?.

    • Could you give me an example of what you’re thinking?

  • Neal Caren

    Crafty how you hid the link to the PETRARCH github page in the “open source” link. Can I ask if you all tried and rejected Link Grammar or just went with the NTLK? I’ve never used Link Grammar, but it looks pretty fast and powerful. I’ve found NLTK to be great for teaching and toy models, but quite slow and clunky for other purposes. That said, I’m excited that people are putting NLTK to use, so I can see how to use it more effectively.

    • Bad Hessian is all about crafty links.

      I don’t know if Link Grammar was considered. I’ll check it out and pass it on to John, though. Thanks for the recommendation, Neal!

    • John

      We had not considered Link Grammar, and I will definitely look into it. We’re trying to stay with Python if possible, and NLTK has actually proven to be one of the faster options. StanfordNLP takes around 4 minutes to parse 400 sentences, while the implementation of PETRARCH that uses the NLTK functionality takes a little over a minute for the same set. Granted, StanfordNLP does everything (NER, coreferencing, etc.) and is more accurate, but 100 sentences per minute is too slow when dealing with projects like GDELT.

      It’s a careful balance that we’re trying to strike between speed, accuracy, and usability/ease-of-contribution for others. Given that, I’m more than open to any suggestions about a better way to do things. I also want to try to avoid reinventing the wheel by writing our own functionality if there are other possibilities.

      • Neal Caren

        Link Grammar is written in C with Python bindings available, so it triples the hassle of a pure Python solution. Might not be a good solution.

        I suspect that you’ll have to train your own named entity recognizer for NLTK–the default one isn’t the best in my experience. I haven’t built one myself, but it seems very doable if you have a good training set. Of course, that is true in theory of many problems.