1

I have this string that I want to process:

rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/.

I want to take out the di/IN jogja/NNP buat/VBT malioboro/NNP words from that sentence. This is my code so far:

def entityExtractPreposition(text):
    text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
    return text

text = "rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/."
prepo = entityExtractPreposition(text)
print prepo

The result take out to much word:

di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP

My expected result is:

di/IN jogja/NNP buat/VBT malioboro/NNP

I read some references said there is a rule to limit repetition (in my case the /NNP) like * / + / ?. What is the best way to initialize or limit how many repetition in regex?

6
  • What's the rule for extraction? Is it everything after the last word/IN item or... Commented Aug 19, 2017 at 6:30
  • @JonClements the rule is take out every word after the word/IN until 2 words of word/NNP Commented Aug 19, 2017 at 6:31
  • So... the first/IN up to and including the second/NNP ? What if there's no NNP/not a second NNP? Commented Aug 19, 2017 at 6:35
  • @JonClements yes, the first/IN up to and including the second/NNP. if there is no a second NNP, the regex stop at the first NNP. It is like initialize maximum NNP to take out, if there is only one NNP, it just take one. Commented Aug 19, 2017 at 6:39
  • Okay - you don't want a regex for this. Just need to get the rules right... so if there's an IN and nothing after it is an NNP then what? And if there's only one NNP but other stuff after that that isn't an NNP is it in the final output or not? Commented Aug 19, 2017 at 6:40

2 Answers 2

1

You have to do this in two passes. Find first a block of /IN -> /NNP, then search within that block to only take up to at most the second (or n) /NNP, eg:

def extract(text, n=2):
    try:
        match = re.search('\w+/IN.*\w+/NNP', text).group()
        last_match = list(re.finditer('\w+/NNP', match))[:n][-1]
        return match[:last_match.end()]
    except AttributeError:
        return ''

Example use and output:

In [36]: extract(text, 1)
Out[36]: 'di/IN jogja/NNP'

In [37]: extract(text, 2)
Out[37]: 'di/IN jogja/NNP buat/VBT malioboro/NNP'

In [38]: extract(text, 3)
Out[38]: 'di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP'

In [39]: extract('nothing to see here')
Out[39]: ''
Sign up to request clarification or add additional context in comments.

Comments

0

The first/IN up to and including the second/NNP

A pattern to implement the rule:

^.*?\b(\w+\/IN(?:.*?\w+\/NNP\b){2})

^.*?      # Starting from the beginning, thus match only first
\b        # A word boundary
(         # Captured group
\w+\/IN   # One or more word chars, then a slash, then 'IN'
(?:       # A non-captured group
.*?\w+    # Anything, lazily matched, followed by one or more word chars
\/NNP\b   # A slash, then 'NNP', then a word boundary
){2}      # Exactly twice
)         # End of captured group

Demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.