I have this string that I want to process:
rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/.
I want to take out the di/IN jogja/NNP buat/VBT malioboro/NNP words from that sentence. This is my code so far:
def entityExtractPreposition(text):
text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
return text
text = "rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/."
prepo = entityExtractPreposition(text)
print prepo
The result take out to much word:
di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP
My expected result is:
di/IN jogja/NNP buat/VBT malioboro/NNP
I read some references said there is a rule to limit repetition (in my case the /NNP) like * / + / ?. What is the best way to initialize or limit how many repetition in regex?
word/INitem or...word/INuntil 2 words ofword/NNP