0

I'm stucked for a while with a regex that does me the following:

  • split my sentences with this: "[\W+]"
  • but if it finds a word like this: "aaa-aa" (not "aaa - aa" or "aaa--aaa-aa"), the word isnt splitted, but the whole word.

    Basically, i want to split a sentece per words, but also considering "aaa-aa" is a word. I'have sucessfully done that by creating two separate functions, one for spliting with \w, and other to find words like "aaa-aa". Finally, i then add both, and subctract each compound word.

    For example, the sentence:

    "Hello my-name is Richard"

    First i collect {Hello, my, name, is, Richard} then i collect {my-name} then i add {my-name} to {Hello, my, name, is, Richard} then i take out {my} and {name} in here {Hello, my, name, is, Richard}. result: {Hello, my-name, is, Richard}

    this approach does what i need, but for parsing large files, this becomes too heavy, because for each sentence there's too many copies needed. So my question is, there is anything i can do to include everything in one pattern? Like:

    "split me the text using this pattern "[\W+], but if you find a word like this "aaa-aa", consider it a word and not two words.

0

5 Answers 5

0

If you want to use a split() rather than explicitly matching the words you are interested in, the following should do what you want: [\s-]{2,}|\s To break that down, you first split on two or more whitespaces and/or hyphens - so a single '-' won't match so 'one-two' will be left alone but something like 'one--two', 'one - two' or even 'one - --- - two' will be split into 'one' and 'two'. That still leaves the 'normal' case of a single whitespace - 'one two' - unmatched, so we add an or ('|') followed by a single whitespace (\s). Note that the order of the alternatives is important - RE subexpressions separated by '|' are evaluated left-to-right so we need to put the spaces-and-hyphens alternative first. If we did it the other way around, when presented with something like 'one -two' we'd match on the first whitespace and return 'one', '-two'.

If you want to interactively play around with Java REs I can thoroughly recommend http://myregexp.com/signedJar.html which allows you to edit the RE and see it matching against a sample string as you edit the RE.

Sign up to request clarification or add additional context in comments.

2 Comments

That works fine, but i want to exclude "word2car" as a word. Instead, "word" and "car" will be considered. What are the necessary changes?
That depends - do you want all occurrences of '2' to be word separators, or just between specific words? If it is anywhere, you could change the RE to \B2\B|[\s-]{2,}|\s. The \B matches a non-word boundary so it would split ' a2b ' but not ' 2nd '. However that would also split numbers containing '2' into separate parts, which is probably not what you want. In that case, (?<=\p{L})2(?=\p{L})|[\s-]{2,}|\s is probably what you want - '2' with a an alphabetic, non-numeric character either side of it.
0

Why not to use pattern \\s+? This does exactly what you want without any tricks: splits text by words separated by whitespace.

3 Comments

i want, for example, "blue-sky" to be considered a word, and not two: {blue, sky}.
OK, so splitting using spaces give you what you need. Just try it.
actually doesnt :p i dont want to consider things like this: "--- --- --" or "aaaa--" or "aaaa--aaaa-aaa".
0

Your description isn't clear enough, but why not just split it up by spaces?

Comments

0

I am not sure whether this pattern would work, because I don't have developer tools for Java, you might try it though, it uses character class substraction, which is supported only in Java regex as far as I know:

[\W&&[^-]]+

it means match characters if they are [\W] and [^-], that is characters are [\W] and not [-].

Comments

0

Almost the same regular expression as in your previous question:

String sentence = "Hello my-name is Richard";
Pattern pattern = Pattern.compile("(?<!\\w)\\w+(-\\w+)?(?!\\w)");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Just added the option (...)? to also match non-hypened words.

3 Comments

worked marvelously. Thank you very much, finally solved this nightmare.
could you do me a favour and update the code with \\W instead of a-zA-a. Because i also want to allow áíõ etc ..
i solved adding this: "À-ÿ", but it seems if instead of all letters we put a word, the pattern will run faster. What you think?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.