48

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags. Here is my attempt:

regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)

Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']

What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]'] or ['Barrack Obama', 'Bill Gates'].

5 Answers 5

73
import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)

yields

['Barack Obama', 'Bill Gates']

The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same unicode as u'[[1P].+?[/P]]+?' except harder to read.

The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,

  • Remove the outer enclosing square brackets. (Also remove the stray 1 in front of P.)
  • To protect the literal brackets in [P], escape the brackets with a backslash: \[P\].
  • To return only the words inside the tags, place grouping parentheses around .+?.
Sign up to request clarification or add additional context in comments.

Comments

16

Try this :

   for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
        # match start: match.start()
        # match end (exclusive): match.end()
        # matched text: match.group()

1 Comment

I really like this answer. If you want to process only matches then this does it without any extra statements like 1) save the list, 2) process the list isn't that equivalent to str = 'purple [email protected], blah monkey [email protected] blah dishwasher' ## Here re.findall() returns a list of all the found email strings emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['[email protected]', '[email protected]'] for email in emails: # do something with each found email string print email
4

Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:

>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']

Comments

2

you can replace your pattern with

regex = ur"\[P\]([\w\s]+)\[\/P\]"

4 Comments

Take care with your formatting; use the preview region. Because you didn't format it properly, the backslashes were guzzled (markdown is poor like that).
Why do you do [\w\s]+ rather than .*? which is what he used? Seems to me .*? is more likely to be what he wants, anyway. [\w\s] is horribly limiting.
The limitation in intentional. I use [\w\s]+ because apparently the asker wants to extract names which rarely contains numbers. Also note that the asker wanted to extract words, not numbers. Just my opinion though, cmiiw
What about names with such interesting features as accents? not re.match('\w', u'é'). If the names are arbitrary, you should not discount the possibility of non-Latin names.
2

Use this pattern,

pattern = '\[P\].+?\[\/P\]'

Check here

1 Comment

This is a duplicate answer (adds nothing from the current top answer), but also, incorrect. It will match but not capture anything (there is no capture group) - it doesn't answer the question, which is to use re.findall to get the matched text.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.