For a week or so now I am busy preparing the UN parallel corpora which I want to use for my BA thesis. For each language (English, Spanish, Russian) I had one very large file in which documents were separated by lines of a particular form. The first thing I did was converting everything from Utf-8 to the proper ISO-8859 encodings in order to reduce file size and also because the sentence aligner I’m using doesn’t handle Unicode correctly. I split up the corpus into several smaller blocks, ran them though the sentence boundary detector, tokenizer-normalizer, etc. The first run of the sentence aligner returned very bad results so I checked the data and found that my parallel corpora didn’t seem very parallel after all – in fact, the number of documents differed as did the documents themselves. Sometimes there were whole paragraphs in one document that couldn’t be aligned with any sentence in the other language (I was actually very surprised to see how well hunalign did in not producing rather random alignments). I then quite cumbersomely matched all document, trying to sort out what didn’t seem to fit. The results were better but still bad – after filtering out improbable alignments I was left with about 5% of the original data mass. I checked back with my supervisor and suddenly and rather accidently we found that it’s all someone else’s fault (didn’t make it much better, though …): I used the Unix’ command recode for conversion from Utf-8 to ISO-8859-5 for Russian, which apparently discarded what it had in cache everytime it wasn’t able to convert something correctly. As a consolation for wasted time (such things are especially wonderful with a quickly approaching deadline) my supervisor pointed me to these words of ease (really funny – other systems certainly aren’t any better, but …).
So I’d like to state a calling: China! I may not think well of you and your obstinate and voracious economical ambitions, your limited ethical understanding and restrictive administration. But in this very case, I believe in your influential technological striving and your stubborness. Rid us of the misery of single-byte encodings as de-facto standard! Noone else has both the power and the need to succeed, once and for all, against the constant struggle for programmers in an international environment who waste their time fiddling with encodings when instead they could be writing software which could save lives. End the domination of anglo-centered character systems and the contra-productive coexistence of incompatible single-byte encodings! Unicode always! Use your growing power for something that does the world good, only this once! In return, I will buy you a good German beer – I know you’d like that.