How to specify repetitions Regex in Python

Question

I have this string that I want to process:

rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/.

I want to take out the di/IN jogja/NNP buat/VBT malioboro/NNP words from that sentence. This is my code so far:

def entityExtractPreposition(text):
    text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
    return text

text = "rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/."
prepo = entityExtractPreposition(text)
print prepo

The result take out to much word:

di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP

My expected result is:

di/IN jogja/NNP buat/VBT malioboro/NNP

I read some references said there is a rule to limit repetition (in my case the /NNP) like * / + / ?. What is the best way to initialize or limit how many repetition in regex?

What's the rule for extraction? Is it everything after the last word/IN item or... — Jon Clements
– Jon Clements, Commented Aug 19, 2017 at 6:30
@JonClements the rule is take out every word after the word/IN until 2 words of word/NNP — ytomo
– ytomo, Commented Aug 19, 2017 at 6:31
So... the first/IN up to and including the second/NNP ? What if there's no NNP/not a second NNP? — Jon Clements
– Jon Clements, Commented Aug 19, 2017 at 6:35
@JonClements yes, the first/IN up to and including the second/NNP. if there is no a second NNP, the regex stop at the first NNP. It is like initialize maximum NNP to take out, if there is only one NNP, it just take one. — ytomo
– ytomo, Commented Aug 19, 2017 at 6:39
Okay - you don't want a regex for this. Just need to get the rules right... so if there's an IN and nothing after it is an NNP then what? And if there's only one NNP but other stuff after that that isn't an NNP is it in the final output or not? — Jon Clements
– Jon Clements, Commented Aug 19, 2017 at 6:40

Jon Clements · Accepted Answer · 2017-08-19 07:42:16Z

1

You have to do this in two passes. Find first a block of /IN -> /NNP, then search within that block to only take up to at most the second (or n) /NNP, eg:

def extract(text, n=2):
    try:
        match = re.search('\w+/IN.*\w+/NNP', text).group()
        last_match = list(re.finditer('\w+/NNP', match))[:n][-1]
        return match[:last_match.end()]
    except AttributeError:
        return ''

Example use and output:

In [36]: extract(text, 1)
Out[36]: 'di/IN jogja/NNP'

In [37]: extract(text, 2)
Out[37]: 'di/IN jogja/NNP buat/VBT malioboro/NNP'

In [38]: extract(text, 3)
Out[38]: 'di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP'

In [39]: extract('nothing to see here')
Out[39]: ''

edited Aug 19, 2017 at 7:42

answered Aug 19, 2017 at 7:33

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

linden2015 · Accepted Answer · 2017-08-19 11:50:28Z

0

The first/IN up to and including the second/NNP

A pattern to implement the rule:

^.*?\b(\w+\/IN(?:.*?\w+\/NNP\b){2})

^.*?      # Starting from the beginning, thus match only first
\b        # A word boundary
(         # Captured group
\w+\/IN   # One or more word chars, then a slash, then 'IN'
(?:       # A non-captured group
.*?\w+    # Anything, lazily matched, followed by one or more word chars
\/NNP\b   # A slash, then 'NNP', then a word boundary
){2}      # Exactly twice
)         # End of captured group

Demo

edited Aug 19, 2017 at 11:50

answered Aug 19, 2017 at 11:44

linden2015

8877 silver badges9 bronze badges

Collectives™ on Stack Overflow

How to specify repetitions Regex in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related