regular expressions saved my life – again

Right, so, I talked in my last entry how the wonders of regular expressions had saved my life, and therefore filled the world with utter joy.  Once again a few days later I find myself faced with a similar problem, and you guessed it, regular expressions saved my life AGAIN.

The problem is essentially the same as before only this time I had a medium-sized database dump as a CSV file and once again I wanted to fill a Java array with the values from certain columns. For the record, as previously, this stuff was all for some JUnit tests I was running. A simplified example of what I was doing is shown below:

1
2
3
4
5
while(itor.hasNext()) {
    Student student = itor.next();
    assertTrue("Student " + student.getId() + " != " + idResults_[count],
                    student.getId() == idResults_[count]);
}

Basically I have a list of Students that I have created whose ids I want to ensure correspond to what I expect them to be. To test this I have a data set of around 250 students (I’m not really that interested in checking the ids, it more a category a student is in but the ids example was easier to show).

In the code above idResults_ corresponds to an int array that I would like to generate from a column in the CSV file. So idResults_ looks something like:

int [] idResults_ = {87868,78757,89987,......};

So how did I generate this array? Well I extended the 5 lines in my last post into a slightly larger Perl script that takes some options and spits out the array initaliser. The actual script can be found HERE. The usage for this script is:

Usage: extract.pl -f <input_file> -c <id> -[hnwisro]
        -h Show this screen
        -n Show the column names in the file
        -w Separate on whitespace (default is a comma)
        -i Don't ignore first line, i.e. it contains the names of the columns
        -s Treat the data as a string, i.e. data in generated array is in 
           double quotes, defaults to an int array
        -r Treat the data as characters, i.e. data in generated array is
           in quotes
        -f  <input_file> Input file
        -c  <id> column to include in array (can be either a number, 
            zero based, or column name)
        -o  <output_file> File array is output to (any other content will be
            over-written)
 
   Outputs a Java/C# array initaliser with values from column <id> from file
 <input_file> and send it out to <output_file>

As you can see I have extended this somewhat from my previous post into a full blow utility (useful probably only to me, but hey who cares). As you can see, instead of creating an int array from the data if you use the -s flag you can create a string array (e.g. {"Harry","Sally", "Billy"}) or a character array using the -r option. Furthermore, you can specify the column to create the array from. This can either be a zero based integer or the id of the column – this presumes that the first line in your file contains that names of the columns (ala CSV file). Also if the first line contains actual data, and you do not want it to be treated as the column names, then you can specify the -i option to choose NOT to ignore the values contained on this line.

Well that’s it. Hope someone else finds the script useful. Over and out.

regular expressions saved my life

A little note to all those programmers who have not taken the time to learn how to use regular expressions: DO IT NOW. I think learning Perl and regular expressions while I was working at Cisco was almost the best thing I took from that job. You never realise how useful it is to know until the day you are presented with a rather large text file and you have to extract some of the data. This is what happened to me last week; and it ain’t the first time either.

The file I was looking at was around 250 lines long and consisted of three tab separated numbers on each line, of which I was interested in only one of these numbers at a time. The idea was that I was trying to generate an array in Java initalised with these numbers, e.g:

int [] = {2,3,4,5,7,8,8,2,3............}

For the record this was just for some testing I was doing. Anyway, doing this without regular expressions would have been a nightmare, as not only did each file have 250 lines but there was 5 files. Instead 5 lines of Perl:

1
2
3
4
5
my $ar = "{";
while($text =~ /^\s*(\d+)\s+(\d+)\s+(\d+)\s*$/gm) {
    $ar .=  "$2,";}
chop $ar;
$ar .= "}";

done it in a flash. Now what to those people who do not know how to use regular expressions do in this situation?

They should teach more people this kind of programming at university. I’m all for learning the theory behind things, but there does have to be a better theory/practical split in my opinion. Maybe even a course called “Practical Computer Programming”, where they teach things like regular expressions, debugging, memory management (yes even for Java programmers), design patterns, useful data structures, etc. If anyone wants to pay me a tidy sum to teach it then your wish would be my command 🙂