1

How can i get this pattern to work:

Pattern pattern = Pattern.compile("[\\p{P}\\p{Z}]");

Basically, this will split my String[] sentence by any kind of punctuation character (p{P} or any kind of whitespace (p{Z}). But i want to exclude the following case:

(?<![A-Za-z-])[A-Za-z]+(?:-[A-Za-z]+){1,}(?![A-Za-z-])

pattern explained here: Java regex patterns

which are the hyphened words like this: "aaa-bb", "aaa-bb-cc", "aaa-bb-c-dd". SO, i can i do that?

4
  • [\\p{P}\\p{Z}^-] would be my guess Commented Oct 5, 2011 at 18:19
  • i dont want to allow this: "--aa", or "bb--c", etc. This two patterns work, i just need to mix the two. Commented Oct 5, 2011 at 18:24
  • Well, then... perhaps [\\p{P}\\p{Z}^(?<![A-Za-z-])[A-Za-z]+(?:-[A-Za-z]+){1,}(?![A-Za-z-])] Commented Oct 5, 2011 at 18:26
  • Sorry, your question is not very clear. You want to match any split between words that is not a hyphen? Commented Oct 5, 2011 at 18:44

1 Answer 1

1

Unfortunately it seems like you can't merge both expressions, at least as far as I know.

However, maybe you can reformulate your problem.

If, for example, you want to split between words (which can contain hyphens), try this expression:

(?>[^\p{L}-]+|-[^\p{L}]+|^-|-$)

This should match any sequence of non-letter characters that are not a minus or any minus that is followed my a non-letter character or that is the first or last character in the input.

Using this expression for a split should result in this:

input="aaa-bb, aaa-bb-cc, aaa-bb-c-dd,no--match,--foo"
ouput={"aaa-bb","aaa-bb-cc","aaa-bb-c-dd","no","match","","foo"}

The regex might need some additional optimization but it is a start.

Edit: This expression should get rid of the empty string in the split:

(?>[^\p{L}-][^\p{L}]*|-[^\p{L}]+|^-|-$)

The first part would now read as "any non-character which is not a minus followed by any number of non-character characters" and should match .-- as well.

Edit: in case you want to match words that could potentially contain hyphens, try this expression:

(?>(?<=[^-\p{L}])|^)\p{L}+(?:-\p{L}+)*(?>(?=[^-\p{L}])|$)

This means "any sequence of letters (\p{L}+) followed by any number of sequences consisting of one minus and at least one more letters ((?:-\p{L}+)*+). That sequence must be preceeded by either the start or anything not a letter or minus ((?>(?<=[^-\p{L}])|^)) and be followed by anything that is not a letter or minus or the end of the input ((?>(?=[^-\p{L}])|$))".

Sign up to request clarification or add additional context in comments.

2 Comments

hm thanks that works. But i need to exclude the cases in which "-" comes first, like: "-a", "-aa-bb" etc..
@user974594 so "-aa-bb" should not be split into "" and "aa-bb"? Could you then provide an example input and expected output? It is not so clear what you're actually trying to achieve.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.