Occasionally, you hear these horrific stories about CL-students or even NLP-professionals who edit text files with MS Word and calculate the number of lines by multiplying the number of pages by the estimated number of lines per page. While these are sad and (hopefully) rare extremes, I recently came across quite some of such error-prone inefficiencies which do not only lead to inaccurate results but, more importantly, make work troublesome and less fun (Just to make this clear: I do not claim to be free of them, either …). Especially when handling large corpora, for many people I know, the first impule is always to write a perl script for everything they want to do with simple text files. Let’s look at the corresponding tasks a little closer: text files constituting corpora are often line- and/or column-based. Typical tasks include counting lines or words, line-by-line comparisons, extracting a certain number of lines from the top or the bottom of the file, finding lines of a particular content and extracting the latter, comparing the first or second column of two files asoasf. It happens regulary that I get blank stares followed by screaming delight when I tell a colleague “You know, there is a Unix command that does exactly what you wrote your scipt for, only much faster and with more options.” For those among you not quite aware of what this is all about (and with apologies to the others …), let me list a few simple real world examples of commands I often use and find most helpful when processing large (and small) text files:
- extract 100 lines from the top of a file:
'head -n 100' - extract 100 lines from the bottom of a file:
'tail -n 100' - compare the first 100 lines of two files:
'paste file1 file2 | head -n 100' - compare the second column of two files and look at the last 10 of them: ‘
cut -f2 file1 file2 | tail' - count the number of unique lines:
'uniq file |wc' - extract everything between ‘SENT’ and ‘SCORE’ on some lines and sort the result:
'grep -P -o '(?<=SENT).+?(?=SCORE)' | sort' - delete the first word on each of the last 200 lines in a file:
tail -n 200 file | cut -f1 --complement -d ' '
Other very useful tools include join, recode (but be warned about its peculiarities), and, of course, bash itself – it almost hurts when I see someone writing a perl script only in order to call a program for all files in a directoy (‘$> for i in dir/*; do program $i; done’). If you know Python (which you should!), almost everything else can be done inside the interactive Python shell with only a couple of lines. Regarding the fact that a majority of students and researcher in CL work on Unix systems (at least in Europe, I’m not sure about the States or Asia), perhaps these are things that should be taught at university as well?
Of course, there is one twist: occasionally, you don’t want to be efficient but simply relax while doing something like DownArrow+Del+Del+Del a hundred times in a row. But it’s much nicer to chose when to be inefficient yourself.