1

I have a log file where some of the entries look like this:

YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC

and I'm trying to get it into a CSV format:

Date,Time,v1,v2,v3,v4,v5
YY/MM/DD,HH:MM:SS:MMM,XXX,YYY,ZZZ,AAA AND BBB,CCC

I'd like to do this in Perl - speaking personally, I could probably do it far quicker in other languages but I'd really like to expand my horizons a bit.

So far I can get as far as reading the file in and picking out only lines which meet my criteria but I can't seem to get the next stage done. I'll need to splice up the input line but so far I just can't work out how to do this. I've looked at s//and m// but they don't really give me what I want. If anyone can advise me how this can be done or give me pointers I'd much appreciate it.

Important points:

  • The values in the second part of the line are always in the same order so mapping / re-organising is not necesarily a problem.
  • Some of the fields have free text which is not quoted :( but as the labels all start v<number>= I'm hoping parsing this should still be a possibility.
2
  • How does m// not give you what you need? This looks like a perfect case for regular expressions. Commented May 12, 2011 at 13:32
  • @Wooble I tried m// and it gave me back 1 (i.e. true / false? ) - I guess I just couldn't figure out how to use it from the examples I found. Commented May 12, 2011 at 14:57

3 Answers 3

6

Since there is no one delimiter, you'll need to try this a few different ways:

First, split on ' ', then take the first three values:

my @array = split / /, $line;
my ($date, $time, $constant) = splice @array, 0, 3;

Join the rest of the fields together again, and re-split on v\d+= to get the values:

my $rest = join ' ', @array;

# $rest should now be "v1=XXX v2=YYY ..."
my @values = split /\s*v\d+=/, $rest;
shift @values; # since the first element in @values will be empty

print join ',', $date, $time, @values;

Edit: Here's another approach that may be easier to follow, and is slightly more efficient. This takes advantage of the fact that your constant text occurs between the date/time and the value list.

# assume that CONSTANT is your constant text
my ($datetime, $valuelist) = split /\s*CONSTANT\s*/, $line;
my ($date, $time) = split / /, $datetime;
my @values = split /\s*v\d+=/, $valuelist;
shift @values;

print join ',', $date, $time, @values, "\n";
Sign up to request clarification or add additional context in comments.

3 Comments

The split on space will mess v4 up, because it contains embedded spaces.
@Dave, that's why I added join ' ', @array before attempting to split on the v\d=.
Thanks very much. I gave your first approach a try and it worked a treat!
4

What have you tried with regular expressions and how has it failed? A regex with m// works fine for me:

#!/usr/bin/env perl

use strict;
use warnings;

print "Date,Time,v1,v2,v3,v4,v5\n";

while (my $line = <DATA>) {
    my @matched = $line =~ m{^([^ ]+) ([^ ]+).*v1=(.*) v2=(.*) v3=(.*) v4=(.*) v5=(.*)};
    print join(',', @matched), "\n";
}

__DATA__
YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC

Two caveats:

1) v1 cannot contain the substring " v2=", v2 cannot contain " v3=", etc., but, with such a loose format, that's something that would likely cause problems for a human attempting to parse it, too.

2) This code assumes that there will always be v1 through v5. If there are fewer than five v*n* fields, the line will fail to match. If there are more, all additional fields will be appended to v5 (including their v*n* tags).

4 Comments

This assumes that the value list always contains exactly 5 values. If that assumption holds, then this works fine, but the match will fail if we ever have a different list than that.
@JSBangs: Good point. Edited to make that assumption explicit.
That's clever. Must admit I did not know about m{} Thanks.
@Robin: Keep in mind that m{} and s{}{} are exactly the same as m// and s/// aside from not needing to escape / characters. I just opted for the {} delimiters to avoid a lot of \/\/\/ when matching the date, then decided to match it without worrying about the slashes anyhow...
1

In case the log is fixed-width, you better off using unpack, you will see its benefits if the log grows very large (performance wise).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.