Python regex findall

Question

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags. Here is my attempt:

regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)

Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']

What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]'] or ['Barrack Obama', 'Bill Gates'].

unutbu · Accepted Answer · 2011-10-13 10:32:59Z

73

import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)

yields

['Barack Obama', 'Bill Gates']

The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same unicode as u'[[1P].+?[/P]]+?' except harder to read.

The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,

Remove the outer enclosing square brackets. (Also remove the stray 1 in front of P.)
To protect the literal brackets in [P], escape the brackets with a backslash: \[P\].
To return only the words inside the tags, place grouping parentheses around .+?.

edited Oct 13, 2011 at 10:32

answered Oct 13, 2011 at 10:20

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

FailedDev · Accepted Answer · 2011-10-13 10:21:12Z

16

Try this :

   for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
        # match start: match.start()
        # match end (exclusive): match.end()
        # matched text: match.group()

answered Oct 13, 2011 at 10:21

FailedDev

27k9 gold badges56 silver badges74 bronze badges

1 Comment

kkron Over a year ago

I really like this answer. If you want to process only matches then this does it without any extra statements like 1) save the list, 2) process the list isn't that equivalent to str = 'purple [email protected], blah monkey [email protected] blah dishwasher' ## Here re.findall() returns a list of all the found email strings emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['[email protected]', '[email protected]'] for email in emails: # do something with each found email string print email

Blair · Accepted Answer · 2011-10-13 10:24:22Z

4

Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:

>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']

answered Oct 13, 2011 at 10:24

Blair

15.9k7 gold badges48 silver badges56 bronze badges

Comments

Chris Morgan · Accepted Answer · 2011-10-13 12:41:41Z

2

you can replace your pattern with

regex = ur"\[P\]([\w\s]+)\[\/P\]"

edited Oct 13, 2011 at 12:41

Chris Morgan

91.6k28 gold badges217 silver badges220 bronze badges

answered Oct 13, 2011 at 10:31

pram

1,52316 silver badges17 bronze badges

4 Comments

Chris Morgan Over a year ago

Take care with your formatting; use the preview region. Because you didn't format it properly, the backslashes were guzzled (markdown is poor like that).

Chris Morgan Over a year ago

Why do you do [\w\s]+ rather than .*? which is what he used? Seems to me .*? is more likely to be what he wants, anyway. [\w\s] is horribly limiting.

pram Over a year ago

The limitation in intentional. I use [\w\s]+ because apparently the asker wants to extract names which rarely contains numbers. Also note that the asker wanted to extract words, not numbers. Just my opinion though, cmiiw

Chris Morgan Over a year ago

What about names with such interesting features as accents? not re.match('\w', u'é'). If the names are arbitrary, you should not discount the possibility of non-Latin names.

Sohn · Accepted Answer · 2016-07-18 06:16:44Z

2

Use this pattern,

pattern = '\[P\].+?\[\/P\]'

Check here

answered Jul 18, 2016 at 6:16

Sohn

1663 silver badges14 bronze badges

1 Comment

LightCC Over a year ago

This is a duplicate answer (adds nothing from the current top answer), but also, incorrect. It will match but not capture anything (there is no capture group) - it doesn't answer the question, which is to use re.findall to get the matched text.

Collectives™ on Stack Overflow

Python regex findall

5 Answers 5

Comments

1 Comment

Comments

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related