Armin Schmidt

March 13, 2007

know your unix tool box

Filed under: nlp — Armin @ 3:05 pm

Occasionally, you hear these horrific stories about CL-students or even NLP-professionals who edit text files with MS Word and calculate the number of lines by multiplying the number of pages by the estimated number of lines per page. While these are sad and (hopefully) rare extremes, I recently came across quite some of such error-prone inefficiencies which do not only lead to inaccurate results but, more importantly, make work troublesome and less fun (Just to make this clear: I do not claim to be free of them, either …). Especially when handling large corpora, for many people I know, the first impule is always to write a perl script for everything they want to do with simple text files. Let’s look at the corresponding tasks a little closer: text files constituting corpora are often line- and/or column-based. Typical tasks include counting lines or words, line-by-line comparisons, extracting a certain number of lines from the top or the bottom of the file, finding lines of a particular content and extracting the latter, comparing the first or second column of two files asoasf. It happens regulary that I get blank stares followed by screaming delight when I tell a colleague “You know, there is a Unix command that does exactly what you wrote your scipt for, only much faster and with more options.” For those among you not quite aware of what this is all about (and with apologies to the others …), let me list a few simple real world examples of commands I often use and find most helpful when processing large (and small) text files:

  • extract 100 lines from the top of a file: 'head -n 100'
  • extract 100 lines from the bottom of a file: 'tail -n 100'
  • compare the first 100 lines of two files: 'paste file1 file2 | head -n 100'
  • compare the second column of two files and look at the last 10 of them: ‘cut -f2 file1 file2 | tail'
  • count the number of unique lines: 'uniq file |wc'
  • extract everything between ‘SENT’ and ‘SCORE’ on some lines and sort the result: 'grep -P -o '(?<=SENT).+?(?=SCORE)' | sort'
  • delete the first word on each of the last 200 lines in a file: tail -n 200 file | cut -f1 --complement -d ' '

Other very useful tools include join, recode (but be warned about its peculiarities), and, of course, bash itself – it almost hurts when I see someone writing a perl script only in order to call a program for all files in a directoy (‘$> for i in dir/*; do program $i; done’). If you know Python (which you should!), almost everything else can be done inside the interactive Python shell with only a couple of lines. Regarding the fact that a majority of students and researcher in CL work on Unix systems (at least in Europe, I’m not sure about the States or Asia), perhaps these are things that should be taught at university as well?

Of course, there is one twist: occasionally, you don’t want to be efficient but simply relax while doing something like DownArrow+Del+Del+Del a hundred times in a row. But it’s much nicer to chose when to be inefficient yourself.

7 Comments »

  1. There is also a nice tool called “shuffle” (not included into standard distributions). It is quite the opposite of “sort” and allows to get random samples from vertical files easily, e.g.
    cat file.vert | shuffle | head -n 250
    would create a 250 lines random sample.

    Comment by DrNI — March 13, 2007 @ 5:57 pm

  2. nice assumption that all Computational Linguists use Unix. All depends which direction you come from. I come from a a mostly humanities background, so unix didn’t feature highly.

    Nowadays, I find myself doing these simple tasks using C#(you don’t even need to go fully over to the darkside because of sharpdevelop and Mono). Doesn’t take long to build a little windows app with loads of buttons to do what you want – rarely needing to type commands, just copy, paste, click, look at results.

    Knowing quite a few people from theoretical linguistics going into field work, none of whom come from a CS background, this is the approach best for them. I tell you they would look at your snippets in your blog, and say “che?”

    Comment by Mark — March 13, 2007 @ 9:40 pm

  3. Well, who would do such a ghastly thing as using Word? I only assume you are joking to support your point. Apart from that, you are so right I want to start crying. Another idiom I use quite frequently is
    “find -name -exec ”
    to execute a given command for a set of files. The only time I use Perl is for “perl -wne” one-liners – apply a regular expression to each line in file and print it out afterwards. Saves you the trouble of learning awk (which I, much to my shame, never did).

    And for the thing with DownArrow…, you know that there are rectangular selections in Emacs? Or keyboard macros? If I want to be bored, I attend the meetings at work:-)

    Comment by Torsten — March 13, 2007 @ 10:55 pm

  4. DrNI:
    Always learning something new – thanks for the input.

    Mark:
    Correction: I didn’t say ‘all’, I said ‘the majority’ and I’m pretty certain that this is correct for – let’s make the CL-vs-NLP distinction this once – natural language processing. I must admit that I don’t really know what kind of things theoretical linguists do when they do field work. But I’m afraid we’re not actually talking about the same thing, which is processing large corpora, not only looking at them but actually traversing them – you are not trying to tell me, that you copy and paste several tens of megabytes of text into a small window and ‘look at the results’? I believe you when you say that those Unix commands look rather cryptic to people without a CS background (although they can be learned in a day). But I could imagine that many people would be surprised to find tedious tasks done ten times more efficient and problems solved by a simple command in a second instead of in an hour by hand (which often means the same thing as copy and paste). Also, I don’t really see the advantage of building window apps with C# when even perl is often superfluous. I think the bottom line is – you should know what your OS (whichever – even MS Windows provides a few similar commands) can do for you because otherwise you will end up reimplementing everything over and over, waste time, and use your hands much more than your head.

    Torsten:
    > you know that there are rectangular selections in Emacs?
    Yes, I know, but I forgot how to do it – must have been something like: press CTRL+’}'+’%’ and select the rectangle with the middle and the right mouse button at the same time? :D

    Comment by Armin — March 14, 2007 @ 1:05 am

  5. No, actually you need to call the GNU project and they’ll log in remotely into your computer and create the rectangular selection.

    C-x r k Cut (kill) a rectangular selection of the text.
    C-x r y Paste (yank) a cut rectangular selection of text.

    Comment by Torsten — March 14, 2007 @ 9:00 am

  6. Armin,
    obviously, a window isn’t satisfactory for large corpora, but it’s a simple extension to traverse a large collection of documents – which I do a heck a lot of. Here’s the thing for me … having written a simple point and shoot app, I can pass the app off to someone who isn’t trained in programming, and who can then do their own analysis without needing to learning about strings, and loops and sorting and regexes etc. It’s about giving simple accessible tools to people who don’t come from CS.

    What do field linguists do? Rely on simple apps that I am talking about. What would be the point of learning unix commands when most of them don’t even know what a commmand prompt looks like.

    I just wanted to stand up for the non-CS linguists – equally valid people, doing really useful stuff with corpora – unfortunately, many of them do use Word for text analysis, let’s try and break that cycle by giving them things that don’t scare them.

    Comment by Mark — March 14, 2007 @ 12:29 pm

  7. Mark:
    Mostly Agreed. When I first wrote the post, the audience I had in mind was one with some background in CS. But I sometimes do think of the others as well :) , which is, for instance, why we are developing this little (yet unfinished) graphical lexicography tool: http://diotavelli.net/phrasehunter

    Comment by Armin — March 14, 2007 @ 1:49 pm


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.