Armin Schmidt

March 13, 2007

know your unix tool box

Filed under: nlp — Armin @ 3:05 pm

Occasionally, you hear these horrific stories about CL-students or even NLP-professionals who edit text files with MS Word and calculate the number of lines by multiplying the number of pages by the estimated number of lines per page. While these are sad and (hopefully) rare extremes, I recently came across quite some of such error-prone inefficiencies which do not only lead to inaccurate results but, more importantly, make work troublesome and less fun (Just to make this clear: I do not claim to be free of them, either …). Especially when handling large corpora, for many people I know, the first impule is always to write a perl script for everything they want to do with simple text files. Let’s look at the corresponding tasks a little closer: text files constituting corpora are often line- and/or column-based. Typical tasks include counting lines or words, line-by-line comparisons, extracting a certain number of lines from the top or the bottom of the file, finding lines of a particular content and extracting the latter, comparing the first or second column of two files asoasf. It happens regulary that I get blank stares followed by screaming delight when I tell a colleague “You know, there is a Unix command that does exactly what you wrote your scipt for, only much faster and with more options.” For those among you not quite aware of what this is all about (and with apologies to the others …), let me list a few simple real world examples of commands I often use and find most helpful when processing large (and small) text files:

  • extract 100 lines from the top of a file: 'head -n 100'
  • extract 100 lines from the bottom of a file: 'tail -n 100'
  • compare the first 100 lines of two files: 'paste file1 file2 | head -n 100'
  • compare the second column of two files and look at the last 10 of them: ‘cut -f2 file1 file2 | tail'
  • count the number of unique lines: 'uniq file |wc'
  • extract everything between ‘SENT’ and ‘SCORE’ on some lines and sort the result: 'grep -P -o '(?<=SENT).+?(?=SCORE)' | sort'
  • delete the first word on each of the last 200 lines in a file: tail -n 200 file | cut -f1 --complement -d ' '

Other very useful tools include join, recode (but be warned about its peculiarities), and, of course, bash itself – it almost hurts when I see someone writing a perl script only in order to call a program for all files in a directoy (‘$> for i in dir/*; do program $i; done’). If you know Python (which you should!), almost everything else can be done inside the interactive Python shell with only a couple of lines. Regarding the fact that a majority of students and researcher in CL work on Unix systems (at least in Europe, I’m not sure about the States or Asia), perhaps these are things that should be taught at university as well?

Of course, there is one twist: occasionally, you don’t want to be efficient but simply relax while doing something like DownArrow+Del+Del+Del a hundred times in a row. But it’s much nicer to chose when to be inefficient yourself.

March 3, 2007

Summary: Sentence Splitters

Filed under: nlp — Armin @ 1:51 am

A couple of weeks ago I was searching the vastness of the web for tools for sentence boundary detection and also asked for hints on the corpora list. I received some very helpful responses which I would like summarize and share. The inital point was that I wanted to sentence-align large parallel corpora in Russian, English, German, and Spanish. None of the rule-based tools I found covered all of the languages at the same time which I thought was rather unfortunate because they are all likely to make some systematic mistakes. For the alignment task, this would be much less of a problem if the same systematic mistakes were made for all languages equally, which should be the case if the same algorithm is applied. Also, I had no lists of abbreviations as required for most rule-based sentence splitters and the ones that I found often proved incomplete, especially since the domain of the corpora brought along their own terminology and abbreviations. For a similar task I had once implemented an algorithm myself which tried to extract abbreviations from larger texts based on the assumption that abbreviations:

  • occur with a dot
  • don’t normally occur without a dot
  • contain no vowels (not always the case)
  • are often preceded/followed by a token of a particular type, e.g. ‘ca.’ is normally followed by a numeral or number, ‘Ms.’ is followed by a name in upper case

Another interesting and successful approach for unsupervised extraction of abbreviations from larger corpora was taken by Jan Strunk and Tibor Kiss in their paper Multilingual Unsupervised Sentence Boundary Detection, where an abbreviation and its dot are regarded as being in a collocation relation that can be statistically learned. Jan was so kind to provide me with a provisional implementation of the algorithm in Perl and also adapted it for use with unicode. I provided a small change so that the program can also be used with ISO-8859-5 encoded Russian data.

Here is a list of other tools for sentence boundary detection that I have come across (thanks for all the hints):

  1. AOT provides a sentence splitter together with morphological, syntactic, and semantic analyzers for Russian. The splitter can be downloaded here (source in C++, dll is included in http://aot.ru/download/shortrml.zip). Unfortunately, the site as well as documentation is almost entirely in Russian, only.
  2. Sebastian Nagel has a fast rule-based sbd for German, Russian, and English. Download here.
  3. Scott Piao’s sentence splitter in Java for English: http://text0.mib.man.ac.uk:8080/sentencebreaker/heuristic_tool
  4. The SRI LM toolkit provides some tools for preprocessing tasks. You’d probably have to download and install the whole software and refer to the manpages.
  5. If you have lists of abbreviations ready, you can use it with Mickel Grönroospython script.
  6. If you have sentence annotated training data, you can use the SATZ system by Palmer and Hearst.
  7. Patrick Tschorn’s SbEditor/Niffler uses inductive logic programming for sbd in German text.

I’d be happy to extend this list, so, please, feel free to suggest more.

February 12, 2007

Packaging major NLP tools?

Filed under: nlp — Armin @ 10:59 pm

A colleague from university once claimed that good scientists often tend to be not so good programmers. There are a couple of things one could say in response, like that perhaps they don’t need to be, that there are in fact many people who are both, and that this tendency is just as true the other way round. After all, each of the two fields are broad and their mastery requires long and continuus occupation, so why not better stick to what you’re good at. One thing, though, puzzles me from time to time: there are a number of tools and packages that are used permanently for certain tasks in NLP, whose installation and sometimes usage is yet so inconvenient that I truly wonder why noone has ever made any effort of improving them.

Take, for instance, the SRI Language Modeling Toolkit – it is still under development but nevertheless widely used for building statistical language models and is required by several other projects, e.g. Moses. But try having it installed by someone without much experience in compiling C/C++ code, perhaps on a machine without administrative rights – [I'm cutting this rantish story for the sake of fairness]. A similar story could be told about many other tools that are widely used.

So I think what is needed is an nlp-repository that has ready-to-install packages in .deb (and perhaps .rpm) format for most of the major architectures. This could be integrated into something like debian-science but not necessarily so. There have been similar individual attempts within particular projects, e.g. with tools for machine translation, but they certainly lack the bigger framework and are restricted to few architectures only.

Is there any project like this? And if not, how, if at all, should one be started?

February 11, 2007

Everybody likes screenshots!

Filed under: nlp, technology — Armin @ 1:59 pm

Nothing grand, I suppose, but I embellished the project page of our little kwic-finder The Phrasehunter and added a couple of screenshots. I hope I’ll find some time in the near future to actually do some code work on it, too.

February 10, 2007

Machine Translation Marathon in Edinburgh

Filed under: nlp — Armin @ 5:57 pm

I hope I’m not spelling other people’s beans but it looks like there is going to be a kind of a spring school for students on (statistical) machine translation, organized by Philipp Koehn at the University of Edinburgh in April within the framework of the EuroMatrix project. It will probably include several lectures as well as hands-on lab sessions and open source development workshops. Andreas Eisele from DFKI asked me and some other people if we would be interested and, well – how I would! It looks like the perfect round-up for the work on my BA thesis and perhaps there might even be something in it for my internship at XRCE. There are three problems, though: the date of the event has not been set, yet, but it might fall into the second part of April which is when K. and I will already watch the Alps from the French side. Another thing is that I simply don’t have any money to afford such a trip at the moment. And last but perhaps most importantly is the fact that, by then at the latest, I expect a short break will do me well.

Blog at WordPress.com.