Parsing structured file in Ruby

Question

I want to parse a large log file (about 500mb). If this isnt the right tool for the job please let me know.

I have a log file with its contents structured like this. Each section can have extra key value pairs:

requestID: saldksadk
time: 92389389
action: foobarr
----------------------
requestID: 2393029
time: 92389389
action: helloworld
source: email
----------------------
requestID: skjflkjasf3
time: 92389389
userAgent: mobile browser
----------------------
requestID: gdfgfdsdf
time: 92389389
action: randoms

I was wondering if there is an easy way to handle each section's data in the log. A section can span multiple lines, so I can't just split the string. For example, is there an easy way to do something like this:

for(section in log){
   // handle section contents
}

Don't downvote if you are not going to give a specific reason. Upvoting — StackOverflower
– StackOverflower, Commented Jun 7, 2013 at 2:45
First thing, don't try to load 500MB into memory at once, which is what you'd have to do to split the file. It's just not a scalable solution. — the Tin Man
– the Tin Man, Commented Jun 7, 2013 at 2:54
I never said I wanted to load it all into memory... That is why I posted here to look for advice. — thunderousNinja
– thunderousNinja, Commented Jun 7, 2013 at 2:57
Many of these solutions could be extended to stream data rather than collect it in memory all at once using yield, I believe. — icktoofay
– icktoofay, Commented Jun 7, 2013 at 3:37

ian · Accepted Answer · 2013-06-07 03:52:24Z

5

Using icktoofay's idea, and by using a custom record separator, I got this:

require 'yaml'

File.open("path/to/file") do |f|
  f.each_line("\n----------------------\n") do |line|
    puts YAML::load(line.sub(/\-{3,}/, "---")).inspect
  end
end

The output:

{"requestID"=>"saldksadk", "time"=>92389389, "action"=>"foobarr"}
{"requestID"=>2393029, "time"=>92389389, "action"=>"helloworld", "source"=>"email"}
{"requestID"=>"skjflkjasf3", "time"=>92389389, "userAgent"=>"mobile browser"}
{"requestID"=>"gdfgfdsdf", "time"=>92389389, "action"=>"randoms"}

edited Jun 7, 2013 at 3:52

answered Jun 7, 2013 at 3:29

ian

12.3k9 gold badges55 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

the Tin Man Over a year ago

Instead of f = File.new "path/to/file" f.each_line("----------------------") ... use File.foreach('path/to/file', "----------------------") .... Ruby will automatically close the file after the block exits.

icktoofay Over a year ago

I think the argument to each_line should have a newline at the beginning and end; otherwise, it will split on dashes in the middle of the line.

ian Over a year ago

@icktoofay It won't if the hyphens are always the same number. If they were variable lengths though, you'd need to do a bit more work on the inside to clear up records of only hyphens, but that's not so bad, perhaps next if line.start_with? "-" or something like that.

icktoofay Over a year ago

@iain: I was referring to dashes inside of values, like userAgent: Mozilla----------------------Firefox

ian Over a year ago

@icktoofay Ah, I see. I tested it and it still works, so I've updated the answer with that.

icktoofay · Accepted Answer · 2013-06-07 02:54:28Z

4

That looks like YAML, although it is not exactly YAML. (YAML separates documents with exactly three dashes, no more.) You might try to mangle your document somehow such that lines consisting of only hyphens are collapsed into three hyphens so it is valid YAML. After that, you can feed it into a YAML parser.

edited Jun 7, 2013 at 2:54

answered Jun 7, 2013 at 2:44

icktoofay

130k23 gold badges261 silver badges239 bronze badges

4 Comments

lc2817 Over a year ago

is ---------------------- valid in yaml?

icktoofay Over a year ago

@lc2817: Yup; it separates documents. See the spec.

icktoofay Over a year ago

@lc2817: Actually, pardon me; it appears that three and only three dashes are allowed. I had incorrectly assumed that more than three dashes were allowed. You are right; it is incorrect YAML.

ian Over a year ago

@icktoofay Good idea, +1. I've used this for idea for my answer.

the Tin Man · Accepted Answer · 2013-06-07 03:24:13Z

I saved your sample text to a file called "test.txt". Opening it with:

File.foreach('test.txt').slice_before(/^---/).to_a

returns:

[
  ["requestID: saldksadk\n", "time: 92389389\n", "action: foobarr\n"], 
  ["----------------------\n", "requestID: 2393029\n", "time: 92389389\n", "action: helloworld\n", "source: email\n"], 
  ["----------------------\n", "requestID: skjflkjasf3\n", "time: 92389389\n", "userAgent: mobile browser\n"], 
  ["----------------------\n", "requestID: gdfgfdsdf\n", "time: 92389389\n", "action: randoms\n"]
]

By running each sub-array through a filter we can strip off the leading "---":

blocks = File.foreach('test.txt').slice_before(/^---/).map { |ary|
  ary.shift if ary.first[/^---/]
  ary.map(&:chomp)
}

After running that blocks is:

[
  ["requestID: saldksadk", "time: 92389389", "action: foobarr"],
  ["requestID: 2393029", "time: 92389389", "action: helloworld", "source: email"],
  ["requestID: skjflkjasf3", "time: 92389389", "userAgent: mobile browser"],
  ["requestID: gdfgfdsdf", "time: 92389389", "action: randoms"]
]

A bit more tweaking:

blocks = File.foreach('test.txt').slice_before(/^---/).map { |ary|
  ary.shift if ary.first[/^---/]
  Hash[ary.map{ |s| s.chomp.split(':') }]
}

and blocks will be:

[
  {"requestID"=>" saldksadk", "time"=>" 92389389", "action"=>" foobarr"},
  {"requestID"=>" 2393029", "time"=>" 92389389", "action"=>" helloworld", "source"=>" email"},
  {"requestID"=>" skjflkjasf3", "time"=>" 92389389", "userAgent"=>" mobile browser"},
  {"requestID"=>" gdfgfdsdf", "time"=>" 92389389", "action"=>" randoms"}
]

Chris Heald · Accepted Answer · 2013-06-07 03:40:25Z

You can read through the file line-by-line. For each line, we'll check if it's a record separator or a key: value pair. If the former, we'll add the current record to the record list. If the latter, we'll add the k:v pair to the current record.

records = []
record = {}
open("data.txt", "r").each do |line|
  if line.start_with? "-"
    records << record unless record.empty?
    record = {}
  else
    k, v = line.split(":", 2).map(&:strip)
    record[k] = v
  end
end
records << record unless record.empty?

This produces something like:

[{"requestID"=>"saldksadk", "time"=>"92389389", "action"=>"foobarr"},
 {"requestID"=>"2393029", "time"=>"92389389", "action"=>"helloworld", "source"=>"email"},
 {"requestID"=>"skjflkjasf3", "time"=>"92389389", "userAgent"=>"mobile browser"}, 
 {"requestID"=>"gdfgfdsdf", "time"=>"92389389", "action"=>"randoms"}]

Chris Cherry · Accepted Answer · 2013-06-07 03:39:12Z

Very basic way to do it, that keeps it simple and efficient:

blocks = []
current_block = {}

sep_range = 0..3
sep_value = "----"

split_pattern = /:\s*/

File.open("filename.txt", 'r') do |f|
  f.each_line do |line|
    if line[sep_range] == sep_value
      blocks << current_block unless current_block.empty?
      current_block = {}
    else
      key, value = line.split(split_pattern, 2)
      current_block[key] = value
    end
  end
end

blocks << current_block unless current_block.empty?

Something key to point out is that we are avoiding creating unnecessary duplicate objects inside the loop (the range, test string, and split regex pattern), and instead defining them before the loop begins, this saves a little bit of time and memory. On a file of 500mb, this could be significant.

Collectives™ on Stack Overflow

Parsing structured file in Ruby

5 Answers 5

5 Comments

4 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

5 Comments

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related