Java: Regex pattern not working

Question

I have a file that I'm parsing that ALWAYS includes an email address. The file is currently laid out with a leading space before the @ and we want to capture the domain.

foo @bar.com more data here
foo @foo.com more data here

We want to pull out @bar.com and @foo.com and I'm just starting to work with regex. I'm trying to pull the pattern "@ at the start of a word boundary inclusive of all following characters up until the next word boundary".

I've tried various iterations of the following, grouping things, square backets for the @ literal...but nothing seems to work.

EDIT - actual code :

import java.util.regex.*;
import java.io.*;
import java.nio.file.*;
import java.lang.*;
//
public class eadd
{
    public static void main(String args[])
    {
        String inputLine = "foo foofoo foo foo @bar.com foofoofoo foo foo foo";
        String eDomain = "";
       // parse eadd
        Pattern p2 = Pattern.compile("(\\b@.*\\b)");
        Matcher m2 = p2.matcher(inputLine);
            if(m2.matches()) {
                eDomain = m2.group(1);
                } else {
                eDomain = "n/a";
            }
        System.out.println(p2+" "+m2+" "+eDomain);
    }
}

And the results when I run it.

(\b@.*\b) java.util.regex.Matcher[pattern=(\b@.*\b) region=0,49 lastmatch=] n/a

All of my problems have been related to the what follows the @ being searched as a literal instead of a pattern (e.g., looking for .* rather than any and all characters). I can't find references to @ being a control character, so I don't think I need to escape out.

There are no similar examples in Oracle's java tutorials or documentation, SO, nor any of the online resources I checked out; I've been unable to find other samples of how people have handled this. Like I said, I'm fairly new with regex, but this looks to me like it should be working to me. What am I missing?

You must use "\\b" not "\b" -- which is a control character for backspace. The additional `` is for escaping. — obataku
– obataku, Commented Aug 8, 2012 at 15:11
Veer is correct; you have to escape the escape character because both Java and regex use it to escape things. — BlackVegetable
– BlackVegetable, Commented Aug 8, 2012 at 15:11
@veer When I try that, it appears to be searching for a literal (\b@.*\b)...I'm printing p2 to the console to see the pattern & my matcher isn't getting any hits. — dwwilson66
– dwwilson66, Commented Aug 8, 2012 at 15:14
matches() tries to match the entire input against your regex - you're looking for a partial match and should use find(). — Jacob is on Codidact
– Jacob is on Codidact, Commented Aug 8, 2012 at 15:33
@JacobRaihle I must've missed that distinction in the tutorials. Thanks for that. — dwwilson66
– dwwilson66, Commented Aug 8, 2012 at 15:40

Jacob is on Codidact · Accepted Answer · 2012-08-08 15:30:57Z

2

Java won't treat @ as a word character - thus there is no word boundary at the start of your address. You could replace the word boundary with a simple whitespace match:

"\s(@.+?)\b"

(Or "\\s(@.+?)\\b" since this is Java) should do the trick. It looks for whitespace followed by @ and matches until the next word boundary.

Edit: Oops, ., just like @, isn't a word character (duh). Use

"\\s(@.+?)(?:\\s|$)"

to match until the next whitespace or EOF. (?:\\s|$) is a non-capturing group that will match any whitespace or end of input.

edited Aug 8, 2012 at 15:30

answered Aug 8, 2012 at 15:23

Jacob is on Codidact

3,78021 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Eugene Over a year ago

for the first simple solution that fits the problem. if you KNOW that this is the format of emails you will be getting then why complicate?

Jacob is on Codidact Over a year ago

@Eugene Right, domains are scary beasts but if they'll always be surrounded by whitespace it makes your life a lot easier :)

dwwilson66 Over a year ago

@Jacob OK. I get it. That makes a lot of sense! One quick clarification: Does regex (or java) require (@+wildcard) to be grouped separately from (wildcard-space at End) or is that just for better readability?

Jacob is on Codidact Over a year ago

I separated the groups because you're not interested in the whitespace - this way only the domain itself is in a capturing group while the whitespace is ignored once the match has been confirmed. A capturing group is one you can access with Matcher#group(int) - putting ?: at the start of a group makes it non-capturing.

oopbase · Accepted Answer · 2012-08-08 15:19:27Z

2

Pattern p = Pattern.compile("(@(?:[a-z][A-Z0-9_]+)\\.(?:[a-z][A-Z]+))");

This should work for you.

This regex starts looking for the @ . After that it looks for any word followed by the ".", followed by another word. For beeing familiar with the syntax you can take a look at this.

edited Aug 8, 2012 at 15:19

answered Aug 8, 2012 at 15:14

oopbase

11.5k13 gold badges42 silver badges60 bronze badges

6 Comments

BlackVegetable Over a year ago

Could you also explain what this regex is doing for the asker?

HeatfanJohn Over a year ago

Don't you mean [a-z][A-Z] to match both upper and lower case? Also, what about the other legal characters permitted in domain names? Digits [0-9] ...

oopbase Over a year ago

@HeatfanJohn Thanks, changed the pattern.

Vala Over a year ago

Personally I'd go with something less restrictive (as email addresses are notoriously loosely defined), more along the lines of \\b@([^\\b]+) (yes, I think you should even avoid requiring a full stop as I fully expect such email addresses to creep up with the new TLDs coming out).

Manimaran Selvan Over a year ago

hyphen is also permitted in urls

|

godspeedlee · Accepted Answer · 2012-08-08 15:21:32Z

1

try with this: Pattern p = Pattern.compile("(?<=\\s)(@(?:bar|foo)\\.com\\b)");
or a general purpose pattern: "(?<=\\s)(@\\w+(?:\\.\\w+)+\\b)"

Explain:
(?<=\\s): look behind for match leading space before @
\\w: match alphabet, digit, underscore
\\b: word boundary
@\\w+(?:\\.\\w+)+: match @bar.com, @bar.com.au, @bar.com.xyz, @bar.foo.xx.yy.zz

answered Aug 8, 2012 at 15:21

godspeedlee

6723 silver badges7 bronze badges

1 Comment

Vala Over a year ago

There really is no need to to use a look-behind, just let your desired result be a group and extract it that way.

Collectives™ on Stack Overflow

Java: Regex pattern not working

3 Answers 3

4 Comments

6 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related