Parsing a log file using perl

Question

I have a log file where some of the entries look like this:

YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC

and I'm trying to get it into a CSV format:

Date,Time,v1,v2,v3,v4,v5
YY/MM/DD,HH:MM:SS:MMM,XXX,YYY,ZZZ,AAA AND BBB,CCC

I'd like to do this in Perl - speaking personally, I could probably do it far quicker in other languages but I'd really like to expand my horizons a bit.

So far I can get as far as reading the file in and picking out only lines which meet my criteria but I can't seem to get the next stage done. I'll need to splice up the input line but so far I just can't work out how to do this. I've looked at s//and m// but they don't really give me what I want. If anyone can advise me how this can be done or give me pointers I'd much appreciate it.

Important points:

The values in the second part of the line are always in the same order so mapping / re-organising is not necesarily a problem.
Some of the fields have free text which is not quoted :( but as the labels all start v<number>= I'm hoping parsing this should still be a possibility.

How does m// not give you what you need? This looks like a perfect case for regular expressions. — Wooble
– Wooble, Commented May 12, 2011 at 13:32
@Wooble I tried m// and it gave me back 1 (i.e. true / false? ) - I guess I just couldn't figure out how to use it from the examples I found. — Component 10
– Component 10, Commented May 12, 2011 at 14:57

JSBձոգչ · Accepted Answer · 2011-05-12 13:52:43Z

6

Since there is no one delimiter, you'll need to try this a few different ways:

First, split on ' ', then take the first three values:

my @array = split / /, $line;
my ($date, $time, $constant) = splice @array, 0, 3;

Join the rest of the fields together again, and re-split on v\d+= to get the values:

my $rest = join ' ', @array;

# $rest should now be "v1=XXX v2=YYY ..."
my @values = split /\s*v\d+=/, $rest;
shift @values; # since the first element in @values will be empty

print join ',', $date, $time, @values;

Edit: Here's another approach that may be easier to follow, and is slightly more efficient. This takes advantage of the fact that your constant text occurs between the date/time and the value list.

# assume that CONSTANT is your constant text
my ($datetime, $valuelist) = split /\s*CONSTANT\s*/, $line;
my ($date, $time) = split / /, $datetime;
my @values = split /\s*v\d+=/, $valuelist;
shift @values;

print join ',', $date, $time, @values, "\n";

edited May 12, 2011 at 13:52

answered May 12, 2011 at 13:33

JSBձոգչ

41.6k19 gold badges106 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dave Sherohman Over a year ago

The split on space will mess v4 up, because it contains embedded spaces.

JSBձոգչ Over a year ago

@Dave, that's why I added join ' ', @array before attempting to split on the v\d=.

Component 10 Over a year ago

Thanks very much. I gave your first approach a try and it worked a treat!

Dave Sherohman · Accepted Answer · 2011-05-12 14:06:03Z

4

What have you tried with regular expressions and how has it failed? A regex with m// works fine for me:

#!/usr/bin/env perl

use strict;
use warnings;

print "Date,Time,v1,v2,v3,v4,v5\n";

while (my $line = <DATA>) {
    my @matched = $line =~ m{^([^ ]+) ([^ ]+).*v1=(.*) v2=(.*) v3=(.*) v4=(.*) v5=(.*)};
    print join(',', @matched), "\n";
}

__DATA__
YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC

Two caveats:

1) v1 cannot contain the substring " v2=", v2 cannot contain " v3=", etc., but, with such a loose format, that's something that would likely cause problems for a human attempting to parse it, too.

2) This code assumes that there will always be v1 through v5. If there are fewer than five v*n* fields, the line will fail to match. If there are more, all additional fields will be appended to v5 (including their v*n* tags).

edited May 12, 2011 at 14:06

answered May 12, 2011 at 13:40

Dave Sherohman

46.4k14 gold badges67 silver badges104 bronze badges

4 Comments

JSBձոգչ Over a year ago

This assumes that the value list always contains exactly 5 values. If that assumption holds, then this works fine, but the match will fail if we ever have a different list than that.

Dave Sherohman Over a year ago

@JSBangs: Good point. Edited to make that assumption explicit.

Component 10 Over a year ago

That's clever. Must admit I did not know about m{} Thanks.

Dave Sherohman Over a year ago

@Robin: Keep in mind that m{} and s{}{} are exactly the same as m// and s/// aside from not needing to escape / characters. I just opted for the {} delimiters to avoid a lot of \/\/\/ when matching the date, then decided to match it without worrying about the slashes anyhow...

snoofkin · Accepted Answer · 2011-05-12 20:24:31Z

1

In case the log is fixed-width, you better off using unpack, you will see its benefits if the log grows very large (performance wise).

answered May 12, 2011 at 20:24

snoofkin

8,91514 gold badges52 silver badges89 bronze badges

Collectives™ on Stack Overflow

Parsing a log file using perl

3 Answers 3

3 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related