1

In the following POS tagged sentence (and similar sentences) what regular expression to use in order to capture only two-word noun noun compounds (i.e. \p{Alnum}+_NN[PS]? \p{Alnum}[PS]?) and avoid capturing two-word matches that are part of larger phrases.

I_PRP will_MD never_RB go_VB to_IN sun_NN devil_NN auto_NN again_RB but_CC my_PRP$ family_NN members_NNS will_MD ._.

In particular I would like to capture family_NN members_NN but not sun_NN devil_NN and devil_NN auto_NN.

Currently I use the following regex with positive lookahead:

"(?=\\b([\\p{Alnum}]+)_(NN[SP]?)\\s([\\p{Alnum}]+)_(NN[SP]?)\\b)."

The problem is in addition to family_NN members_NNS it captures sun_NN devil_NN, devil_NN auto_NN.

1

1 Answer 1

1

You need both a lookahead and a lookbehind here.

Basically, you want, for some pattern P, that PP is matched if and only if there is not a P before or after it.

Crude way, with the lookahead and lookbehind operators:

(?<!P)PP(?!P)

The (?<!...) and (?!...) are respectively the negative lookbehind and negative lookahead anchors in regexes, where ... stands for the regex.

If we take P to be:

[\p{AlNum}]+_NN[PS]?

and accounting for spaces, then one sketch of a solution, allowing for spaces between each token, would look like:

private static final String P = "[\\p{AlNum}]+_NN[PS]?";
private static final String RE = "(?<!" + P + ")"
    + "\\s+(" + P + "\\s+" + P + ")\\s+(?!" + P + ")";
private static final Pattern PATTERN = Pattern.compile(RE);

This is only a sketch however.

Given the complexity of the input, you probably want to do more, so not sure that regexes are the tool you are really looking for in the end.

Sign up to request clarification or add additional context in comments.

5 Comments

However Java does not allow the + quantifier in a lookbehind (variable length) but allows finite repition eg (?<!\s\p{Alnum}{1,20}_NN[SP]?\s) if this would be sufficient.
@fge Thanks for your response. What if PP is in the beginning or at the end of the sentence?
@bobblebubble are you sure because + seems to be working in the regex that I mentioned in my question.
@Meghi Afaik + or * won't work within a lookbehind in Java regex.
@fge RE doesn't capture PPs that occur in the beginning or at the end of a sentence because of \\s+. I tried to fix this by changing \\s+ to \\s* but when I do so sun_NN devil_NN is captured. How can I fix this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.