0

I am trying to parse a raw email. The desired result is a hash of the lines that contain specific headers.

This is the Ruby file:

raw_email = File.open("sample-email.txt", "r")
parsed_email = Hash.new('')

raw_email.each do |line|
  puts line
  header = line.chomp(":")
  puts header
  if header == "Delivered-To"
    parsed_email[:to] = line
  elsif header == "From"
    parsed_email[:from] = line
  elsif header == "Date"
    parsed_email[:date] = line
  elsif header == "Subject"
    parsed_email[:subject] = line
  end
end

puts parsed_email

And this is the raw email:

Delivered-To: [email protected]
From: John Doe <[email protected]>
Date: Tue, 12 Dec 2017 13:30:14 -0500
Subject: Testing the parser
To: [email protected]
Content-Type: multipart/alternative; 
boundary="123456789abcdefghijklmnopqrs"

--123456789abcdefghijklmnopqrs
Content-Type: text/plain; charset="UTF-8"

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer nec 
odio. Praesent libero. Sed cursus ante dapibus diam. Sed nisi. Nulla 
quis sem at nibh elementum imperdiet. Duis sagittis ipsum.

--123456789abcdefghijklmnopqrs
Content-Type: text/html; charset="UTF-8"

<div dir="ltr">Lorem ipsum dolor sit amet, consectetur adipiscing 
elit. Integer nec odio. Praesent libero. Sed cursus ante dapibus diam. 
Sed nisi. Nulla quis sem at nibh elementum imperdiet. Duis sagittis 
ipsum.<br clear="all">
</div>

--089e082c24dc944a9f056028d791--

The puts statements are just for my own testing to see if data is being passed along.

What I am getting is each full line put twice and an empty hash put at the end.

I have also tried changing different bits to strings or arrays and I've also tried using line.split(":", 1) instead of line.chomp(":")

Can someone please explain why this isn't working?

3
  • Chomp removes trailing characters (default is the newline). You want both: line.chomp.split(":") Commented Dec 13, 2017 at 13:28
  • I see. I was under the impression chomp would "split" at the last found supplied (or default) delimiter in a string and then drop everything after (including the delimiter). Commented Dec 13, 2017 at 14:05
  • BTW, your current approach is completely broken for folded header bodies. Commented Dec 13, 2017 at 16:42

1 Answer 1

1

Try this

raw_email = File.open("sample-email.txt", "r")
parsed_email = {}

raw_email.each do |line|
  case line.split(":")[0]
  when "Delivered-To"
    parsed_email[:to] = line
  when "From"
    parsed_email[:from] = line
  when "Date"
    parsed_email[:date] = line
  when "Subject"
    parsed_email[:subject] = line
  end
end

puts parsed_email
=> {:to=>"Delivered-To: [email protected]\n", :from=>"From: John Doe <[email protected]>\n", :date=>"Date: Tue, 12 Dec 2017 13:30:14 -0500\n", :subject=>"Subject: Testing the parser\n"}

Explanation You need to split line on : and select first. Like this line.split(":")[0]

Sign up to request clarification or add additional context in comments.

4 Comments

Of course. I tried it exactly like that originally, but forgot that split returns an array so I had left off the [0]. That led me down a rabbit hole of confusion. Thank you so much. This is exactly what I was trying to do.
This being said, I recommend using a mail parsing gem and not doing it yourself.
LOL. I completely agree. Normally I would, but, I'm trying to learn how mail parsing works. ;)
Also: the scan should stop after the first blank line (which marks the end of the headerblock). This avoids scanning the full body (including any attachments), which can be large -- or the body could even contain matching fields which would produce wrong results.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.