java regex tricky pattern

Question

I'm stucked for a while with a regex that does me the following:

split my sentences with this: "[\W+]"
but if it finds a word like this: "aaa-aa" (not "aaa - aa" or "aaa--aaa-aa"), the word isnt splitted, but the whole word.

Basically, i want to split a sentece per words, but also considering "aaa-aa" is a word. I'have sucessfully done that by creating two separate functions, one for spliting with \w, and other to find words like "aaa-aa". Finally, i then add both, and subctract each compound word.

For example, the sentence:

"Hello my-name is Richard"

First i collect {Hello, my, name, is, Richard} then i collect {my-name} then i add {my-name} to {Hello, my, name, is, Richard} then i take out {my} and {name} in here {Hello, my, name, is, Richard}. result: {Hello, my-name, is, Richard}

this approach does what i need, but for parsing large files, this becomes too heavy, because for each sentence there's too many copies needed. So my question is, there is anything i can do to include everything in one pattern? Like:

"split me the text using this pattern "[\W+], but if you find a word like this "aaa-aa", consider it a word and not two words.

Alan Burlison · Accepted Answer · 2011-10-02 13:50:32Z

0

If you want to use a split() rather than explicitly matching the words you are interested in, the following should do what you want: [\s-]{2,}|\s To break that down, you first split on two or more whitespaces and/or hyphens - so a single '-' won't match so 'one-two' will be left alone but something like 'one--two', 'one - two' or even 'one - --- - two' will be split into 'one' and 'two'. That still leaves the 'normal' case of a single whitespace - 'one two' - unmatched, so we add an or ('|') followed by a single whitespace (\s). Note that the order of the alternatives is important - RE subexpressions separated by '|' are evaluated left-to-right so we need to put the spaces-and-hyphens alternative first. If we did it the other way around, when presented with something like 'one -two' we'd match on the first whitespace and return 'one', '-two'.

If you want to interactively play around with Java REs I can thoroughly recommend http://myregexp.com/signedJar.html which allows you to edit the RE and see it matching against a sample string as you edit the RE.

edited Oct 2, 2011 at 13:50

answered Oct 2, 2011 at 13:41

Alan Burlison

1,0611 gold badge9 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

recoInrelax Over a year ago

That works fine, but i want to exclude "word2car" as a word. Instead, "word" and "car" will be considered. What are the necessary changes?

Alan Burlison Over a year ago

That depends - do you want all occurrences of '2' to be word separators, or just between specific words? If it is anywhere, you could change the RE to \B2\B|[\s-]{2,}|\s. The \B matches a non-word boundary so it would split ' a2b ' but not ' 2nd '. However that would also split numbers containing '2' into separate parts, which is probably not what you want. In that case, (?<=\p{L})2(?=\p{L})|[\s-]{2,}|\s is probably what you want - '2' with a an alphabetic, non-numeric character either side of it.

AlexR · Accepted Answer · 2011-10-02 11:47:31Z

0

Why not to use pattern \\s+? This does exactly what you want without any tricks: splits text by words separated by whitespace.

answered Oct 2, 2011 at 11:47

AlexR

116k16 gold badges137 silver badges216 bronze badges

3 Comments

recoInrelax Over a year ago

i want, for example, "blue-sky" to be considered a word, and not two: {blue, sky}.

AlexR Over a year ago

OK, so splitting using spaces give you what you need. Just try it.

recoInrelax Over a year ago

actually doesnt :p i dont want to consider things like this: "--- --- --" or "aaaa--" or "aaaa--aaaa-aaa".

Zoltán · Accepted Answer · 2011-10-02 11:50:10Z

0

Your description isn't clear enough, but why not just split it up by spaces?

answered Oct 2, 2011 at 11:50

Zoltán

14111 bronze badges

Comments

Alexey · Accepted Answer · 2011-10-02 12:37:12Z

0

I am not sure whether this pattern would work, because I don't have developer tools for Java, you might try it though, it uses character class substraction, which is supported only in Java regex as far as I know:

[\W&&[^-]]+

it means match characters if they are [\W] and [^-], that is characters are [\W] and not [-].

answered Oct 2, 2011 at 12:37

Alexey

9196 silver badges12 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:03:35Z

0

Almost the same regular expression as in your previous question:

String sentence = "Hello my-name is Richard";
Pattern pattern = Pattern.compile("(?<!\\w)\\w+(-\\w+)?(?!\\w)");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Just added the option (...)? to also match non-hypened words.

edited May 23, 2017 at 12:03

CommunityBot

11 silver badge

answered Oct 2, 2011 at 11:49

Howard

39.3k9 gold badges68 silver badges85 bronze badges

3 Comments

recoInrelax Over a year ago

worked marvelously. Thank you very much, finally solved this nightmare.

recoInrelax Over a year ago

could you do me a favour and update the code with \\W instead of a-zA-a. Because i also want to allow áíõ etc ..

recoInrelax Over a year ago

i solved adding this: "À-ÿ", but it seems if instead of all letters we put a word, the pattern will run faster. What you think?

Collectives™ on Stack Overflow

java regex tricky pattern

5 Answers 5

2 Comments

3 Comments

Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

3 Comments

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related