Armin Schmidt

March 1, 2007

I’m calling to you, China!

Filed under: technology — Armin @ 11:50 pm

For a week or so now I am busy preparing the UN parallel corpora which I want to use for my BA thesis. For each language (English, Spanish, Russian) I had one very large file in which documents were separated by lines of a particular form. The first thing I did was converting everything from Utf-8 to the proper ISO-8859 encodings in order to reduce file size and also because the sentence aligner I’m using doesn’t handle Unicode correctly. I split up the corpus into several smaller blocks, ran them though the sentence boundary detector, tokenizer-normalizer, etc. The first run of the sentence aligner returned very bad results so I checked the data and found that my parallel corpora didn’t seem very parallel after all – in fact, the number of documents differed as did the documents themselves. Sometimes there were whole paragraphs in one document that couldn’t be aligned with any sentence in the other language (I was actually very surprised to see how well hunalign did in not producing rather random alignments). I then quite cumbersomely matched all document, trying to sort out what didn’t seem to fit. The results were better but still bad – after filtering out improbable alignments I was left with about 5% of the original data mass. I checked back with my supervisor and suddenly and rather accidently we found that it’s all someone else’s fault (didn’t make it much better, though …): I used the Unix’ command recode for conversion from Utf-8 to ISO-8859-5 for Russian, which apparently discarded what it had in cache everytime it wasn’t able to convert something correctly. As a consolation for wasted time (such things are especially wonderful with a quickly approaching deadline) my supervisor pointed me to these words of ease (really funny – other systems certainly aren’t any better, but …).

So I’d like to state a calling: China! I may not think well of you and your obstinate and voracious economical ambitions, your limited ethical understanding and restrictive administration. But in this very case, I believe in your influential technological striving and your stubborness. Rid us of the misery of single-byte encodings as de-facto standard! Noone else has both the power and the need to succeed, once and for all, against the constant struggle for programmers in an international environment who waste their time fiddling with encodings when instead they could be writing software which could save lives. End the domination of anglo-centered character systems and the contra-productive coexistence of incompatible single-byte encodings! Unicode always! Use your growing power for something that does the world good, only this once! In return, I will buy you a good German beer – I know you’d like that.

:)

February 11, 2007

Everybody likes screenshots!

Filed under: nlp, technology — Armin @ 1:59 pm

Nothing grand, I suppose, but I embellished the project page of our little kwic-finder The Phrasehunter and added a couple of screenshots. I hope I’ll find some time in the near future to actually do some code work on it, too.

January 13, 2007

Linux and publicity

Filed under: technology — Armin @ 9:08 pm

Germany’s federal main office for political education BPB has just published a dossier on Open Source, containing several informative articles on background, history, idea of Open Source and its applicability to other areas of public life like culture, media, and education. It also features an interview with Larry Sanger about Wikipedia and his new project Citizendium. This is but one step to raise more interest in the topic among people, mainly students, who are not IT-professionals.

But what’s it worth? Even now, as distributions like Ubuntu provide desktop systems which can well compare with Windows in usability and support, it still is common for non-IT-professionals (or geeks) to not ever have heard of Linux or any particular one of its distros. No surprise here. When Windows Vista was released this month every major newspaper published loads of articles, only few of which actually containing any substantial information. I cannot remember any particularly interesting reactions in newspapers not specialized in technology when a Linux distro or related major technology released a new version. Searching the famous and important newspaper magazine Süddeutsche for the term “Windows” I get 780 results at the time of writing, whereas a search for “Linux” returns only 188 hits. A query for “Ubuntu” barely returns 8 hits. Do these results seriously reflect the impact Open Source software has today? I would think no. Regarding servers, Linux is way ahead ahead of anything else. Also, I would bet that, in science and at universities, most people use Unix and its derivates for at least some of their work (I’d be grateful for any hints regarding specific studies and statistics).

What it does reflect, though, is how much money Microsoft is able to put into advertising their software, and the power it gains by means of such shady marketing methods as selling cheap licences to original equipment manufacturers (OEMs). This is, of course, money none of the linux distributions, being non-commercial organizations, can possibly afford. Still, I think that individual Linux distros should try to put much more effort into spreading the word among the common home users as a target group, who usually use their computers for nothing but reading and writing email, surfing the web, or composing documents with office programms. Such efforts should try to make Free Software much more visible. And it should provide names people can remember and identify with (Compare the expressivness of the term “windows” to that of “kde” “x.org” or, after all, “Linux”, and you see what I mean …). The technologically interested gather information from sources like Heise Online (German). But what still surprises me is that the common press seems to be mostly blind towards what’s going on in this field. Don’t journalists ought to have a certain commitment to providing balanced, unbiased information? Or is this view too naive? I am sure that much of the prejudice against Free Software is still due to simple unawareness and the impression of many that only nerds do not use Windows for whatever reasons these freaks might have.

Blog at WordPress.com.