Java Pattern Regular Expression

Question

Sample code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Regex {
public static void main(String[] args) {
    String data = ". Shyam and you. You are 2.3 km away from home. Lakshmi and you. Ram and you. You are Mike. ";
    Pattern pattern = Pattern.compile("(?<=\\.\\s)(.*?are.*?)(?=\\.\\s)");
    Matcher matcher = pattern.matcher(data);
    while (matcher.find()) {
        System.out.println(matcher.group(1));
        }
    }
}

Desired output :

You are 2.3 km away from home

You are Mike

But the real output was

Shyam and you. You are 2.3 km away from home

Lakshmi and you. Ram and you. You are Mike

Please help.

Is there a reason that You are 2.3 km away from home. occurs twice in the input and only once in the desired output? — Thomas
– Thomas, Commented Aug 27, 2013 at 15:30
Tip : ^ and $ allows you capturing the beginning and the end of a String — Arnaud Denoyelle
– Arnaud Denoyelle, Commented Aug 27, 2013 at 15:31
@Thomas Oh sorry! That was my mistake. Thanks for correcting me :) — user2722117
– user2722117, Commented Aug 27, 2013 at 15:38

Thomas · Accepted Answer · 2013-08-28 15:58:29Z

2

Your expression matches the first dot and .*? would match dots as well. Thus you get Shyam and you... as a match. Try changing (.*?are.*?) to ([^\\.]*?are[^\\.]*?) to match all characters except the dot.

Note that you could also simplify your expression to \s*([^\.]*are[^\.]*) (non-Java notation here). This would have the same result but would also match "You are Shyam. You are Mike.".

This expression would match any sequence of characters not beeing a dot with an "are" in between and preceded by optional whitespace. Note that this would also match are alone, so you might want to change [^\.]* to [^\.]+.

Edit:

To account for your updated example, you could try this expression (a break down follows):

\s*((?:[^\.]|(?:\w+\.)+\w)*are.*?)(?:\.\s|\.$)

Input: I am here. You are almost 2.3 km away from home. You are Mike. You are 2. 2.3 percent of them are 2.3 percent of all. Sections 2.3.a to 2.3.c are 3 sections. This is garbage.

Output: You are almost 2.3 km away from home, You are Mike, You are 2, 2.3 percent of them are 2.3 percent of all, Sections 2.3.a to 2.3.c are 3 sections

A few notes: this would require each sentence to end with a dot (this could be changed by replacing \.\s|\.$ with [.!?]\s|[.!?]$), each delimiting dot to be followed by either a whitespace or the end of the input and would not match You are J. J. Abrams or 2.a

Note that in that case it is really hard for the computer to determine the end of the sentence, especially with "simple" regex.

Expression break down:

\s* leading whitespace would not be part of the group, otherwise this is not needed
((?:[^\.]|(?:\w+\.)+\w)*are.*?) The captured group, containing the are and additional text before and after
- (?:[^\.]|(?:\w+\.)+\w) a non-capturing group matching either any sequence of non-dot characters ([^\.]) or (|) a a sequence of word characters (\w as a shortcut for [a-zA-Z0-9_]) with single dots in between ((?:\w+\.)+\w), also non-capturing)
- .*? any sequence of characters but with a lazy modifier to match the shortest possible sequence instead of the longest (without it, the next part wouldn't make much sense)
(?:\.\s|\.$) a non-capturing group that must follow the captured group, it must either match a dot followed by whitespace (\.\s) or (|) a dot at the end of the input (\.$)

Edit 2:

Here's a not thoroughly tested version without a (A|B)* group:

\s*([^.]*(?:(?:\w+\.)+\w+[^.]*)*are.*?)(?:[.!?]\s|[.!?]$)

Basically (?:[^\.]|(?:\w+\.)+\w)* has been replaced with [^.]*(?:(?:\w+\.)+\w+[^.]*)*, which means "any sequence of non-dot characters followed by any number of sequences consisting of dots surrounded by word characters and followed by any sequence of non-dot characters". ;)

edited Aug 28, 2013 at 15:58

answered Aug 27, 2013 at 15:16

Thomas

88.9k13 gold badges126 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user2722117 Over a year ago

I have edited my question to change the example string. Please have a look into it. I think you can solve my problem :)

user2722117 Over a year ago

That's all what i wanted :) Thanks Thomas :)

user2722117 Over a year ago

I tried your regex in my original data( approx. 800 sentences). It resulted in an overflow error. On searching about the same. I came to know that (A|B)* like patterns in our regex causes the error. Is there anyway to make a regex without the same?

Thomas Over a year ago

@user2722117 I'll think about it but there might not be an easy solution. However, you could try and split your input into individual sentences first and then apply the regex to each of them.

user2722117 Over a year ago

Thank you :) If there is no solution, i will move on with splitting the input string :)

loscuropresagio · Accepted Answer · 2013-08-27 15:19:15Z

0

Try this regex:

"[\\. ]([^\\. ]* are [^\\. ]*)[\\. ]"

answered Aug 27, 2013 at 15:19

loscuropresagio

1,97215 silver badges26 bronze badges

3 Comments

user2722117 Over a year ago

eg: You and jm. You are the 2.3 km away from home. You can do that. I want the regex to work with this example also.

AJMansfield Over a year ago

@user2722117 It is really unclear what you are trying to say in your comment. Put stuff in quotes or code ticks to delimit your additional example.

user2722117 Over a year ago

@AJMansfield Yea, me too feels the same with my previous comment. Anyways, I have updated my question with the same example :)

Community · Accepted Answer · 2017-02-08 14:43:57Z

0

You can try the regular expression:

You are (\d+(\.\d+)?|\w+| )*

Regular expression visualization

e.g.:

private static final Pattern REGEX_PATTERN = 
        Pattern.compile("You are (\\d+(\\.\\d+)?|\\w+| )*");

public static void main(String[] args) {
    String input = ". Shyam and you. You are 2.3 km away from home. Lakshmi and you. Ram and you. You are Mike. ";

    Matcher matcher = REGEX_PATTERN.matcher(input);
    while (matcher.find()) {
        System.out.println(matcher.group());
    }
}

Output:

You are 2.3 km away from home
You are Mike

edited Feb 8, 2017 at 14:43

CommunityBot

11 silver badge

answered Aug 27, 2013 at 15:40

Paul Vargas

42.2k16 gold badges108 silver badges148 bronze badges

1 Comment

dimo414 Over a year ago

Nice visualization. For the benefit of future readers, this came from Debuggex.com.

Collectives™ on Stack Overflow

Java Pattern Regular Expression

3 Answers 3

5 Comments

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related