1

Sample code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Regex {
public static void main(String[] args) {
    String data = ". Shyam and you. You are 2.3 km away from home. Lakshmi and you. Ram and you. You are Mike. ";
    Pattern pattern = Pattern.compile("(?<=\\.\\s)(.*?are.*?)(?=\\.\\s)");
    Matcher matcher = pattern.matcher(data);
    while (matcher.find()) {
        System.out.println(matcher.group(1));
        }
    }
}

Desired output :

You are 2.3 km away from home

You are Mike

But the real output was

Shyam and you. You are 2.3 km away from home

Lakshmi and you. Ram and you. You are Mike

Please help.

6
  • See: stackoverflow.com/questions/1232220/… Commented Aug 27, 2013 at 15:16
  • Is there a reason that You are 2.3 km away from home. occurs twice in the input and only once in the desired output? Commented Aug 27, 2013 at 15:30
  • Tip : ^ and $ allows you capturing the beginning and the end of a String Commented Aug 27, 2013 at 15:31
  • @ArnaudDenoyelle I don't see how your tip would help here. Commented Aug 27, 2013 at 15:35
  • @Thomas Oh sorry! That was my mistake. Thanks for correcting me :) Commented Aug 27, 2013 at 15:38

3 Answers 3

2

Your expression matches the first dot and .*? would match dots as well. Thus you get Shyam and you... as a match. Try changing (.*?are.*?) to ([^\\.]*?are[^\\.]*?) to match all characters except the dot.

Note that you could also simplify your expression to \s*([^\.]*are[^\.]*) (non-Java notation here). This would have the same result but would also match "You are Shyam. You are Mike.".

This expression would match any sequence of characters not beeing a dot with an "are" in between and preceded by optional whitespace. Note that this would also match are alone, so you might want to change [^\.]* to [^\.]+.

Edit:

To account for your updated example, you could try this expression (a break down follows):

\s*((?:[^\.]|(?:\w+\.)+\w)*are.*?)(?:\.\s|\.$)

Input: I am here. You are almost 2.3 km away from home. You are Mike. You are 2. 2.3 percent of them are 2.3 percent of all. Sections 2.3.a to 2.3.c are 3 sections. This is garbage.

Output: You are almost 2.3 km away from home, You are Mike, You are 2, 2.3 percent of them are 2.3 percent of all, Sections 2.3.a to 2.3.c are 3 sections

A few notes: this would require each sentence to end with a dot (this could be changed by replacing \.\s|\.$ with [.!?]\s|[.!?]$), each delimiting dot to be followed by either a whitespace or the end of the input and would not match You are J. J. Abrams or 2.a

Note that in that case it is really hard for the computer to determine the end of the sentence, especially with "simple" regex.

Expression break down:

  • \s* leading whitespace would not be part of the group, otherwise this is not needed
  • ((?:[^\.]|(?:\w+\.)+\w)*are.*?) The captured group, containing the are and additional text before and after
    • (?:[^\.]|(?:\w+\.)+\w) a non-capturing group matching either any sequence of non-dot characters ([^\.]) or (|) a a sequence of word characters (\w as a shortcut for [a-zA-Z0-9_]) with single dots in between ((?:\w+\.)+\w), also non-capturing)
    • .*? any sequence of characters but with a lazy modifier to match the shortest possible sequence instead of the longest (without it, the next part wouldn't make much sense)
  • (?:\.\s|\.$) a non-capturing group that must follow the captured group, it must either match a dot followed by whitespace (\.\s) or (|) a dot at the end of the input (\.$)

Edit 2:

Here's a not thoroughly tested version without a (A|B)* group:

\s*([^.]*(?:(?:\w+\.)+\w+[^.]*)*are.*?)(?:[.!?]\s|[.!?]$)

Basically (?:[^\.]|(?:\w+\.)+\w)* has been replaced with [^.]*(?:(?:\w+\.)+\w+[^.]*)*, which means "any sequence of non-dot characters followed by any number of sequences consisting of dots surrounded by word characters and followed by any sequence of non-dot characters". ;)

Sign up to request clarification or add additional context in comments.

5 Comments

I have edited my question to change the example string. Please have a look into it. I think you can solve my problem :)
That's all what i wanted :) Thanks Thomas :)
I tried your regex in my original data( approx. 800 sentences). It resulted in an overflow error. On searching about the same. I came to know that (A|B)* like patterns in our regex causes the error. Is there anyway to make a regex without the same?
@user2722117 I'll think about it but there might not be an easy solution. However, you could try and split your input into individual sentences first and then apply the regex to each of them.
Thank you :) If there is no solution, i will move on with splitting the input string :)
0

Try this regex:

"[\\. ]([^\\. ]* are [^\\. ]*)[\\. ]"

3 Comments

eg: You and jm. You are the 2.3 km away from home. You can do that. I want the regex to work with this example also.
@user2722117 It is really unclear what you are trying to say in your comment. Put stuff in quotes or code ticks to delimit your additional example.
@AJMansfield Yea, me too feels the same with my previous comment. Anyways, I have updated my question with the same example :)
0

You can try the regular expression:

You are (\d+(\.\d+)?|\w+| )*

Regular expression visualization

e.g.:

private static final Pattern REGEX_PATTERN = 
        Pattern.compile("You are (\\d+(\\.\\d+)?|\\w+| )*");

public static void main(String[] args) {
    String input = ". Shyam and you. You are 2.3 km away from home. Lakshmi and you. Ram and you. You are Mike. ";

    Matcher matcher = REGEX_PATTERN.matcher(input);
    while (matcher.find()) {
        System.out.println(matcher.group());
    }
}

Output:

You are 2.3 km away from home
You are Mike

1 Comment

Nice visualization. For the benefit of future readers, this came from Debuggex.com.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.