3

I've got a webhook posting to a form on my web application and I need to parse out the email header addresses.

Here is the source text:

Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: "Lastname, Firstname" <[email protected]>
To: <[email protected]>, [email protected], [email protected]
Cc: <[email protected]>, [email protected]
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]

I'm looking to pull out the following:

<[email protected]>, [email protected], [email protected]

I'm been struggling with Regex all day without any luck.

3
  • 5
    I would recommend using a library designated to parsing MIME personally. Commented Apr 27, 2011 at 15:26
  • Brad, I do not have the entire message though just the header string. I'm not sure MIME components will work with just this portion. Commented Apr 27, 2011 at 15:35
  • @Brad Christine given the upvotes on your comment you should post this as an answer ;) Commented Apr 27, 2011 at 17:52

5 Answers 5

6

Contrary to some of the posts here I have to agree with mmutz, you cannot parse emails with a regex... see this article:

https://www.rfc-editor.org/rfc/rfc2822#section-3.4.1

3.4.1. Addr-spec specification

An addr-spec is a specific Internet identifier that contains a locally interpreted string followed by the at-sign character ("@", ASCII value 64) followed by an Internet domain.

The idea of "locally interpreted" means that only the receiving server is expected to be able to parse it.

If I were going to try and solve this I would find the "To" line contents, break it apart and attempt to parse each segment with System.Net.Mail.MailAddress.

    static void Main()
    {
        string input = @"Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: ""Lastname, Firstname"" <[email protected]>
To: <[email protected]>, ""Yes, this is valid""@[emails are hard to parse!], [email protected], [email protected]
Cc: <[email protected]>, [email protected]
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]";

        Regex toline = new Regex(@"(?im-:^To\s*:\s*(?<to>.*)$)");
        string to = toline.Match(input).Groups["to"].Value;

        int from = 0;
        int pos = 0;
        int found;
        string test;
        
        while(from < to.Length)
        {
            found = (found = to.IndexOf(',', from)) > 0 ? found : to.Length;
            from = found + 1;
            test = to.Substring(pos, found - pos);

            try
            {
                System.Net.Mail.MailAddress addy = new System.Net.Mail.MailAddress(test.Trim());
                Console.WriteLine(addy.Address);
                pos = found + 1;
            }
            catch (FormatException)
            {
            }
        }
    }

Output from the above program:

[email protected]
"Yes, this is valid"@[emails are hard to parse!]
[email protected]
[email protected]
Sign up to request clarification or add additional context in comments.

4 Comments

this looks very promising...doing some unit testing right now.
@Blindy Yea, very "right-ISH" I agree. Without a library it's hopefully 'good-enough'.
Yep I think 'good enough' is the right term. I'm going to log every request, and mark any messages that don't parse so I can re-evaluate after some volume.
@csharptest.net Been using this code since 2017 without problems but all of the sudden my IDE started complaining about the regex: 'Option character' expected. The problem here is the ?im-: part. All modes following the - sign are turned off but there are none in your expression. IMO the only thing making sense here is ?im (ignore case, multi-line mode) since C# Regex default modes are case-sensitive and single-line. You could also do new Regex(@"(^To\s*:\s*(?<to>.*)$)", RegexOptions.IgnoreCase | RegexOptions.Multiline)
2

The RFC 2822-compliant email regex is:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Just run it over your text and you'll get the email addresses.

Of course, there's always the option of not using regex where regex isn't the best option. But up to you!

2 Comments

BTW, your 'RFC' regex for emails does not handle quoted-string properly, if fails to match: "Yes, this is valid"@domain.com
"almost" RFC-compliant then I guess. Just goes to show, regex isn't the best tool for this :)
0

You cannot use regular expressions to parse RFC2822 mails, because their grammar contains a recursive production (off the top of my head, it was for comments (a (nested) comment)) which makes the grammar non-regular. Regular expressions (as the name suggests) can only parse regular grammars.

See also RegEx match open tags except XHTML self-contained tags for more information.

1 Comment

While you are right in an academic context, any PCRE (which C#'s implementation is part of) is more than a plain old regular expression parser, it's closer to a context free grammar parser, which can indeed parse recursive parenthesis. This is a case of technology outgrowing the name of the construct.
0

As Blindy suggests, sometimes you can just parse it out the old-fashioned way.

If you prefer to do that, here is a quick approach assuming the email header text is called 'header':

int start = header.IndexOf("To: ");
int end = header.IndexOf("Cc: ");
string x = header.Substring(start, end-start);

I may be off by a byte on the subtraction but you can very easily test and modify this. Of course you will also have to be certain you always will have a Cc: row in your header or this won't work.

Comments

0

There's a breakdown of validating emails with regex here, which references a more practical implementation of RFC 2822 with:

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

It also looks like you only want the email addresses out of the "To" field, and you've got the <> to worry about as well, so something like the following would likely work:

^To: ((?:\<?[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\>?,?(?:\s*))*)

Again, as others having mentioned, you might not want to do this. But if you want regex that will turn that input into <[email protected]>, [email protected], [email protected], that'll do it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.