0

I have a file that I'm parsing that ALWAYS includes an email address. The file is currently laid out with a leading space before the @ and we want to capture the domain.

foo @bar.com more data here
foo @foo.com more data here

We want to pull out @bar.com and @foo.com and I'm just starting to work with regex. I'm trying to pull the pattern "@ at the start of a word boundary inclusive of all following characters up until the next word boundary".

I've tried various iterations of the following, grouping things, square backets for the @ literal...but nothing seems to work.

EDIT - actual code :

import java.util.regex.*;
import java.io.*;
import java.nio.file.*;
import java.lang.*;
//
public class eadd
{
    public static void main(String args[])
    {
        String inputLine = "foo foofoo foo foo @bar.com foofoofoo foo foo foo";
        String eDomain = "";
       // parse eadd
        Pattern p2 = Pattern.compile("(\\b@.*\\b)");
        Matcher m2 = p2.matcher(inputLine);
            if(m2.matches()) {
                eDomain = m2.group(1);
                } else {
                eDomain = "n/a";
            }
        System.out.println(p2+" "+m2+" "+eDomain);
    }
}

And the results when I run it.

(\b@.*\b) java.util.regex.Matcher[pattern=(\b@.*\b) region=0,49 lastmatch=] n/a

All of my problems have been related to the what follows the @ being searched as a literal instead of a pattern (e.g., looking for .* rather than any and all characters). I can't find references to @ being a control character, so I don't think I need to escape out.

There are no similar examples in Oracle's java tutorials or documentation, SO, nor any of the online resources I checked out; I've been unable to find other samples of how people have handled this. Like I said, I'm fairly new with regex, but this looks to me like it should be working to me. What am I missing?

5
  • 6
    You must use "\\b" not "\b" -- which is a control character for backspace. The additional `` is for escaping. Commented Aug 8, 2012 at 15:11
  • Veer is correct; you have to escape the escape character because both Java and regex use it to escape things. Commented Aug 8, 2012 at 15:11
  • @veer When I try that, it appears to be searching for a literal (\b@.*\b)...I'm printing p2 to the console to see the pattern & my matcher isn't getting any hits. Commented Aug 8, 2012 at 15:14
  • matches() tries to match the entire input against your regex - you're looking for a partial match and should use find(). Commented Aug 8, 2012 at 15:33
  • @JacobRaihle I must've missed that distinction in the tutorials. Thanks for that. Commented Aug 8, 2012 at 15:40

3 Answers 3

2

Java won't treat @ as a word character - thus there is no word boundary at the start of your address. You could replace the word boundary with a simple whitespace match:

"\s(@.+?)\b"

(Or "\\s(@.+?)\\b" since this is Java) should do the trick. It looks for whitespace followed by @ and matches until the next word boundary.

Edit: Oops, ., just like @, isn't a word character (duh). Use

"\\s(@.+?)(?:\\s|$)"

to match until the next whitespace or EOF. (?:\\s|$) is a non-capturing group that will match any whitespace or end of input.

Sign up to request clarification or add additional context in comments.

4 Comments

for the first simple solution that fits the problem. if you KNOW that this is the format of emails you will be getting then why complicate?
@Eugene Right, domains are scary beasts but if they'll always be surrounded by whitespace it makes your life a lot easier :)
@Jacob OK. I get it. That makes a lot of sense! One quick clarification: Does regex (or java) require (@+wildcard) to be grouped separately from (wildcard-space at End) or is that just for better readability?
I separated the groups because you're not interested in the whitespace - this way only the domain itself is in a capturing group while the whitespace is ignored once the match has been confirmed. A capturing group is one you can access with Matcher#group(int) - putting ?: at the start of a group makes it non-capturing.
2
Pattern p = Pattern.compile("(@(?:[a-z][A-Z0-9_]+)\\.(?:[a-z][A-Z]+))");

This should work for you.

This regex starts looking for the @ . After that it looks for any word followed by the ".", followed by another word. For beeing familiar with the syntax you can take a look at this.

6 Comments

Could you also explain what this regex is doing for the asker?
Don't you mean [a-z][A-Z] to match both upper and lower case? Also, what about the other legal characters permitted in domain names? Digits [0-9] ...
@HeatfanJohn Thanks, changed the pattern.
Personally I'd go with something less restrictive (as email addresses are notoriously loosely defined), more along the lines of \\b@([^\\b]+) (yes, I think you should even avoid requiring a full stop as I fully expect such email addresses to creep up with the new TLDs coming out).
hyphen is also permitted in urls
|
1

try with this: Pattern p = Pattern.compile("(?<=\\s)(@(?:bar|foo)\\.com\\b)");
or a general purpose pattern: "(?<=\\s)(@\\w+(?:\\.\\w+)+\\b)"

Explain:
(?<=\\s): look behind for match leading space before @
\\w: match alphabet, digit, underscore
\\b: word boundary
@\\w+(?:\\.\\w+)+: match @bar.com, @bar.com.au, @bar.com.xyz, @bar.foo.xx.yy.zz

1 Comment

There really is no need to to use a look-behind, just let your desired result be a group and extract it that way.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.