Converting raw log file data into CSV file

Question

I have some raw data I scraped from a log file, which currently reads as:

"   80:  0.20%:  2/Jan/14 21:01: /site/podcasts/audio/2013/podcast-07-15-2013.mp3", 
"   71:  0.16%:  14/Jan/14 12:18: /site/podcasts/audio/2013/podcast-11-04-2013.mp3", 
"   67:  0.17%:  2/Jan/14 23:44: /site/podcasts/audio/podcast-3-21-2011.mp3", 
"   67:  0.15%:  15/Jan/14 09:25: /site/podcasts/audio/2013/podcast-08-05-2013.mp3", 
"   64:  0.12%:  2/Jan/14 07:40: /site/podcasts/audio/2013/podcast-11-04-2013-1.mp3",

I need to convert gather three pieces of information into data for an Excel spreadsheet -- the number before the intitial colon, the date, and the URL. So if I converted it into CSV, it would read as

80, 2/Jan/14, /site/podcasts/audio/2013/podcast-07-15-2013.mp3
71, 14/Jan/14, /site/podcasts/audio/2013/podcast-11-04-2013.mp3
67, 2/Jan/14, /site/podcasts/audio/podcast-3-21-2011.mp3

And so on. However, I'm having trouble figuring out how to do that. I wrote some regexes to capture the right data, but I'm not sure how to convert those regexes into what I need.

There's this regex to get the first number: ^"\s{3}(\d+)

And this regex could get the date: (\d+\/\w{3}\/14)

And this regex could get the URL: (\/site\/podcasts\/audio\/.*\.mp3)

However, I'm not sure how to take these regexes and convert them into the CSV I need. Any ideas?

Does your log file actually have those quotes and commas in it? — JKillian
– JKillian, Commented Aug 21, 2014 at 14:06
Hi @sawa, yes, it's an array (I removed the brackets). I noticed that before you had posted what seemed like an elegant solution to this problem that's no longer there -- unfortunately, I haven't been able to look into testing it before now. Is there a reason why it's been deleted? — CodeBiker
– CodeBiker, Commented Aug 22, 2014 at 17:09
And @jkillian, no, log the file does not. The data above is what I scraped from the log file with my Ruby script. — CodeBiker
– CodeBiker, Commented Aug 22, 2014 at 17:10

JKillian · Accepted Answer · 2014-08-22 17:57:25Z

1

I personally wouldn't use regular expressions:

output = ''
File.open("path/to/log", "r") do |f|
  f.each_line do |line|
    num, percent, date, time, url = line.split(/\s+/)
    num = num[0..-2]  # removes the colon from the end of the number
    output << "#{num}, #{date}, #{url}\n"
  end
end

# do whatever you want with the result
puts output

And this prints:

80, 2/Jan/14, /site/podcasts/audio/2013/podcast-07-15-2013.mp3
71, 14/Jan/14, /site/podcasts/audio/2013/podcast-11-04-2013.mp3
67, 2/Jan/14, /site/podcasts/audio/podcast-3-21-2011.mp3
67, 15/Jan/14, /site/podcasts/audio/2013/podcast-08-05-2013.mp3
64, 2/Jan/14, /site/podcasts/audio/2013/podcast-11-04-2013-1.mp3

There are shorter, more clever ways to do this, but I like this way because it's readable and clear.

edited Aug 22, 2014 at 17:57

answered Aug 21, 2014 at 14:14

JKillian

18.4k8 gold badges45 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

CodeBiker Over a year ago

Hi @jkillian, this seems like an excellent solution -- thank you. However, I am getting outputs in the format of , 0.20%:, 21:01: and so on -- so it's pulling the wrong pieces of data from each line. I'm trying to figure it out, but do you have an idea of why?

JKillian Over a year ago

@CodeBiker Your log format might be a little different that what I expected... Add in puts line above the line with split and give me an example of what a line looks like

CodeBiker Over a year ago

This seems to work, although I removed the part about the colon because sometimes that number has more than two digits:

File.open("log", "r") do |f|   f.each_line do |line|     quote, num, percent, date, time, url = line.split(/\s+/)     output << "#{num}, #{date}, #{url}\n"   end end

JKillian Over a year ago

@CodeBiker Great, glad it worked for you! Actually though, it doesn't matter how many digits the number has: The 0..-2 range includes the characters from the first character (0) to the second to last character (-2) inclusive. So that line basically just trims off the last character no matter what.

asontu · Accepted Answer · 2014-08-21 14:09:47Z

1

This puts your matches together and in capture groups that you can then later handle in Ruby. I'm unfamiliar with Ruby but I imagine you can concatenate the strings that the capture-groups return.

^"\s{3}(\d+)(?:[\s:]|\d\.\d\d%)*(\d+\/\w{3}\/14)[\s\d:]*(\/site\/podcasts\/audio\/.*\.mp3)

Regular expression visualization

Debuggex Demo

answered Aug 21, 2014 at 14:09

asontu

4,6591 gold badge24 silver badges30 bronze badges

1 Comment

CodeBiker Over a year ago

I marked the above jkillian's answer as correct because it got me to the right information. However, your regex is excellent -- the problem is that I'm not familiar enough with Ruby to know how to concatenate the strings from the capture groups.

vks · Accepted Answer · 2014-08-21 14:10:07Z

1

\s+(\d+):\s+.*?(\d+\/\w+\/\d+)\s+.*?(\/.*?)\".*

Try this.Please look at the demo.

http://regex101.com/r/cA4wE0/10

answered Aug 21, 2014 at 14:10

vks

68.1k11 gold badges96 silver badges132 bronze badges

Collectives™ on Stack Overflow

Converting raw log file data into CSV file

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related