… when you mess with the environment, man! It’s the beginning of March and we had a mosquito in our bedroom! It bit me right into my behind, probably just to make the point. Of all the catastrophical threads that await us – floods, hurricanes, and what not – these little beasts I fear most. Let’s just hope that whatever place we may live in, it will be a desert rather than a swamp.
March 7, 2007
March 5, 2007
TODO
1. Check own scripts and their output in a more systematic way, even, and especially, if they are for one-time use, only.
2. Test hypothesis before attempting to work with them. Think stuff all the way through, not just half the way.
ps: Dear reader, please, don’t ask …
March 3, 2007
Summary: Sentence Splitters
A couple of weeks ago I was searching the vastness of the web for tools for sentence boundary detection and also asked for hints on the corpora list. I received some very helpful responses which I would like summarize and share. The inital point was that I wanted to sentence-align large parallel corpora in Russian, English, German, and Spanish. None of the rule-based tools I found covered all of the languages at the same time which I thought was rather unfortunate because they are all likely to make some systematic mistakes. For the alignment task, this would be much less of a problem if the same systematic mistakes were made for all languages equally, which should be the case if the same algorithm is applied. Also, I had no lists of abbreviations as required for most rule-based sentence splitters and the ones that I found often proved incomplete, especially since the domain of the corpora brought along their own terminology and abbreviations. For a similar task I had once implemented an algorithm myself which tried to extract abbreviations from larger texts based on the assumption that abbreviations:
- occur with a dot
- don’t normally occur without a dot
- contain no vowels (not always the case)
- are often preceded/followed by a token of a particular type, e.g. ‘ca.’ is normally followed by a numeral or number, ‘Ms.’ is followed by a name in upper case
Another interesting and successful approach for unsupervised extraction of abbreviations from larger corpora was taken by Jan Strunk and Tibor Kiss in their paper Multilingual Unsupervised Sentence Boundary Detection, where an abbreviation and its dot are regarded as being in a collocation relation that can be statistically learned. Jan was so kind to provide me with a provisional implementation of the algorithm in Perl and also adapted it for use with unicode. I provided a small change so that the program can also be used with ISO-8859-5 encoded Russian data.
Here is a list of other tools for sentence boundary detection that I have come across (thanks for all the hints):
- AOT provides a sentence splitter together with morphological, syntactic, and semantic analyzers for Russian. The splitter can be downloaded here (source in C++, dll is included in http://aot.ru/download/shortrml.zip). Unfortunately, the site as well as documentation is almost entirely in Russian, only.
- Sebastian Nagel has a fast rule-based sbd for German, Russian, and English. Download here.
- Scott Piao’s sentence splitter in Java for English: http://text0.mib.man.ac.uk:8080/sentencebreaker/heuristic_tool
- The SRI LM toolkit provides some tools for preprocessing tasks. You’d probably have to download and install the whole software and refer to the manpages.
- If you have lists of abbreviations ready, you can use it with Mickel Grönroos‘ python script.
- If you have sentence annotated training data, you can use the SATZ system by Palmer and Hearst.
- Patrick Tschorn’s SbEditor/Niffler uses inductive logic programming for sbd in German text.
I’d be happy to extend this list, so, please, feel free to suggest more.
March 1, 2007
I’m calling to you, China!
For a week or so now I am busy preparing the UN parallel corpora which I want to use for my BA thesis. For each language (English, Spanish, Russian) I had one very large file in which documents were separated by lines of a particular form. The first thing I did was converting everything from Utf-8 to the proper ISO-8859 encodings in order to reduce file size and also because the sentence aligner I’m using doesn’t handle Unicode correctly. I split up the corpus into several smaller blocks, ran them though the sentence boundary detector, tokenizer-normalizer, etc. The first run of the sentence aligner returned very bad results so I checked the data and found that my parallel corpora didn’t seem very parallel after all – in fact, the number of documents differed as did the documents themselves. Sometimes there were whole paragraphs in one document that couldn’t be aligned with any sentence in the other language (I was actually very surprised to see how well hunalign did in not producing rather random alignments). I then quite cumbersomely matched all document, trying to sort out what didn’t seem to fit. The results were better but still bad – after filtering out improbable alignments I was left with about 5% of the original data mass. I checked back with my supervisor and suddenly and rather accidently we found that it’s all someone else’s fault (didn’t make it much better, though …): I used the Unix’ command recode for conversion from Utf-8 to ISO-8859-5 for Russian, which apparently discarded what it had in cache everytime it wasn’t able to convert something correctly. As a consolation for wasted time (such things are especially wonderful with a quickly approaching deadline) my supervisor pointed me to these words of ease (really funny – other systems certainly aren’t any better, but …).
So I’d like to state a calling: China! I may not think well of you and your obstinate and voracious economical ambitions, your limited ethical understanding and restrictive administration. But in this very case, I believe in your influential technological striving and your stubborness. Rid us of the misery of single-byte encodings as de-facto standard! Noone else has both the power and the need to succeed, once and for all, against the constant struggle for programmers in an international environment who waste their time fiddling with encodings when instead they could be writing software which could save lives. End the domination of anglo-centered character systems and the contra-productive coexistence of incompatible single-byte encodings! Unicode always! Use your growing power for something that does the world good, only this once! In return, I will buy you a good German beer – I know you’d like that.
February 15, 2007
Anlaufschwierigkeiten x 2
Nachdem die letzen paar Wochen mit gleich mehreren säähr guten Neuigkeiten aufwarteten, hat sich bei mir mittlerweile wieder eine kühle Nüchternheit eingestellt. Da ist zum einen die Bachelor-Arbeit, deren Thema “Statistical Machine Translation between New Language Pairs via Multiple Intermediaries” sein wird. Es geht grob gesagt darum, ein SMT-System zu implementieren, das vom Russischen ins Deutsche übersetzt und dies aufgrund der Knappheit von russisch-deutschen Parallelkorpora über zwei (später drei) intermediäre Sprachen tut. Letztere sind in meinem Fall Englisch und Spanisch. Die Prüfungsordnung sieht genau sechs Wochen für die komplette Arbeit vor und obwohl ich gerade erst begonne habe, liege ich schon wieder glatt hinter dem Zeitplan. Letzteren habe ich zack-zack durchstrukturiert:
- Notwendige Software installieren. Moses-Doku lesen: ~4 Tage
- Korpora aufbereiten & Phrasentabellen trainieren: 6-8 Tage
- Moses-Decoder an neue Funktionalität anpassen: ~10 Tage
- Evaluierung: ~1 Tag
- Schreiben der Arbeit: 10 Tage
Im Moment halte ich mich, das klang eventuell schon aus meinem letzten Eintrag raus, irgendwie immer noch an Punkt 1 auf, genauer gesagt am ersten Teil von Punkt 1. Während ich auf Antworten auf meine Support-Requests wartete, hatte mich sogar schonmal kurz daran versucht, mich in den Source-Code von Moses reinzulesen, musste aber schnell feststellen, dass ich eigentlich nicht so richtig weiß, wo anfangen. Ich muss halt eine bestimmte Stelle im Decoder ändern, zunächst muss ich diese Stelle aber erstmal finden! Klar, die Dokumentation könnte auch an vielen Stellen etwas ausführlicher sein, aber bei einem so großen Projekt stellt sich mir generell die Frage: Wie geht man an sowas eigentlich ran? (Bitte keine Scheu bei guten Ratschlägen!)
Die nächste Überaschung war, dass sich die Arbeit an Punkt 2 meines Plans auch verzögert, weil ich keinen Zugriff auf die mein Konto am DFKI mehr zu haben scheine. Dort liegen nämlich die UN-Parallelkorpora, die ich nutzen will. Zu dumm!
Zum Ausgleich durfte ich heute das germanistische Seminar in der schönen Heidelberger Altstadt besuchen, denn ich werde im nächsten Semester Germanistik studieren. Oder? Naja, es ist nämlich so: Um das Praktikum am XRCE durchführen zu dürfen, muss man Student, d.h. immatrikuliert, sein. Da ich noch in diesem Semester meinen Bachelor-Abschluss bekomme und mit dem Master erst im Oktober beginne, sei ich also in den dazwischen liegenden sechs Monaten kein Student, behauptet man. Der internationalen Bürokratie zuliebe renne ich nun von hier nach da, um Formulare unterschreiben zu lassen, die niemals in die Hände eines Menschen geraten werden, der tatsächlich weiß, wozu sie da und ob sie überhaupt nötig sind. Ich begreife erst jetzt, warum Bürokratie so behäbig ist (und ihren schlechten Ruf verdient) – weil aus Unwissenheit und falschem Respekt sich niemand traut, mal in die Hände zu spucken und zu tun, was, Verordnung hin Bestimmung her, einfach nur Sinn macht.
February 12, 2007
Packaging major NLP tools?
A colleague from university once claimed that good scientists often tend to be not so good programmers. There are a couple of things one could say in response, like that perhaps they don’t need to be, that there are in fact many people who are both, and that this tendency is just as true the other way round. After all, each of the two fields are broad and their mastery requires long and continuus occupation, so why not better stick to what you’re good at. One thing, though, puzzles me from time to time: there are a number of tools and packages that are used permanently for certain tasks in NLP, whose installation and sometimes usage is yet so inconvenient that I truly wonder why noone has ever made any effort of improving them.
Take, for instance, the SRI Language Modeling Toolkit – it is still under development but nevertheless widely used for building statistical language models and is required by several other projects, e.g. Moses. But try having it installed by someone without much experience in compiling C/C++ code, perhaps on a machine without administrative rights – [I'm cutting this rantish story for the sake of fairness]. A similar story could be told about many other tools that are widely used.
So I think what is needed is an nlp-repository that has ready-to-install packages in .deb (and perhaps .rpm) format for most of the major architectures. This could be integrated into something like debian-science but not necessarily so. There have been similar individual attempts within particular projects, e.g. with tools for machine translation, but they certainly lack the bigger framework and are restricted to few architectures only.
Is there any project like this? And if not, how, if at all, should one be started?
February 11, 2007
Everybody likes screenshots!
Nothing grand, I suppose, but I embellished the project page of our little kwic-finder The Phrasehunter and added a couple of screenshots. I hope I’ll find some time in the near future to actually do some code work on it, too.
February 10, 2007
Machine Translation Marathon in Edinburgh
I hope I’m not spelling other people’s beans but it looks like there is going to be a kind of a spring school for students on (statistical) machine translation, organized by Philipp Koehn at the University of Edinburgh in April within the framework of the EuroMatrix project. It will probably include several lectures as well as hands-on lab sessions and open source development workshops. Andreas Eisele from DFKI asked me and some other people if we would be interested and, well – how I would! It looks like the perfect round-up for the work on my BA thesis and perhaps there might even be something in it for my internship at XRCE. There are three problems, though: the date of the event has not been set, yet, but it might fall into the second part of April which is when K. and I will already watch the Alps from the French side. Another thing is that I simply don’t have any money to afford such a trip at the moment. And last but perhaps most importantly is the fact that, by then at the latest, I expect a short break will do me well.
February 6, 2007
Stille
Stille. Zwei Wochen lang klickte ich mich durch verständliche und unverständliche PDF-Dateien, beschrieb wohl an die sechzig Seiten mit Fragen und Antworten, Sichworten, Formeln, Diagrammen und Matrizen, Strukturbäumen und Aufbauschemata. Ich zermarterte mir das Hirn über unklare Herleitungen, hatte über Tage hinweg Kopfschmerzen und Magenbeschwerden. Für eine halbe Stunde entspannten Gesprächs mit zwei freundlichen Damen.
Heute morgen schlief ich lang und frühstückte spät und reichlich. Ich duschte in Ruhe und rasierte mich ausführlich. Dann bereitete ich einen Aufguss eines der intensivsten hierzulande erhältlichen Grüntees, einem Geschenk von K., setzte mich auf’s Sofa und aß ein Stück hochprozentiger Schokolade. Die pointierten Noten des Angel Song schufen eine Atmosphäre entspannter Konzentration. Ich überflog meine Notizen ein letztes Mal und trank aus.
Die Prüfung verlief ohne Zwischenfälle, das Ergebnis war sehr gut.
Was bedeutete diese halbe Stunde? Das Resumée von dreieinhalb Jahren Studiums. Kein Abfragen fachlichen Wissens, sondern Examination persönlicher Fähigkeiten. Nach ihr ein Plateau, dessen Existenz man immer sicher, das aber immer zu weit weg war, als dass man es hätte tatsächlich begreifen können. Ich sehe diese Zahl vor mir, meine Note, und weiß, sie hätte besser kaum sein können. Sie sagt mir nichts.
January 29, 2007
The one word that saved the day …
… or I should better say, the year, was “ACCEPTED”: for an internship at the Xerox Research Center Europe in Grenoble, that is. That’s right, Armin is going to France! What this means is, first of all, what my prospective supervisors paraphrased as “a total experience in the Machine Translation Team”. The internship will be about cross-linguistic information retrieval and split into two parts – one more in the realm of development and one more on the theoretical side. I needn’t say that I’m very excited and also a bit nervous. But learning is, as far as I’m concerned, always about asking just a little too much of oneself.
The best news is, though, that my ladyfriend is coming with me! We both applied (almost) independently for two different internships at XRCE, both of which happened to fit our profiles and interests extremely well and, luckily, both got accepted. So, apart from the technological and scientific aspect, this trip will just as well be a total experience of the Alps, wine, cheese, and bavarder français. It’s all during summer time, Lyon isn’t far. The Mediterranean isn’t, either. Realistically speaking, I hope we won’t be programming only …