Python XML regular expression matching issue

Question

I have been trying to match tag names only (without the < and > signs) is cases of regular tags:

<w:tag w:attrib1="http://url" w:attrib2="anyValue">

without matching solo tags (opening-closing tags):

<w:tag2 w:attrib1="anyValue" w:attrib2="http://url" />

(please pay attention to the URLs in the attributes as they contain forward slashes (/))

but could not manage to get to it with:

regex = re.compile('(?<=<)w:\w+(?=[\w\W]+>)(?!\s/>)')

print(regex.findall(string))

getting this:

['w:tag','w:tag2']

expecting this:

['w:tag']

any thoughts?

Cheers.

Community · Accepted Answer · 2017-05-23 12:27:05Z

1

1) Go easy on the lookahead/lookbehind; they're hard to control and you rarely really need them. Use capturing groups to extract part of the matched string. Use negative character classes and non-greedy search (if needed) to avoid matching too much:

re.findall(r'<\s*(w:\w+)[^>]*(?<!/)>', string)

Easier to read, isn't it? However,

2) Don't do this at all! Don't rely on REs to match XML or html, you're just asking for heartbreak. See https://stackoverflow.com/a/1732454/699305 for the details. :-) Get familiar with using python's xml.etree.ElementTree with xpath expressions instead. It'll take some getting used to, but it will be time well spent-- you won't regret it.

edited May 23, 2017 at 12:27

CommunityBot

11 silver badge

answered Oct 27, 2012 at 21:00

alexis

50.4k18 gold badges107 silver badges172 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

devdc Over a year ago

I know XML and lxml too well and loving it. Although this time I Need to handle some broken stuff... thanks for your detailed answer. It works like charm and indeed looks better than what I've come up with.

Emil · Accepted Answer · 2012-10-27 18:21:00Z

0

Found it:

regex = re.compile('(?<=<)w:\w+(?=>)|(?<=<)w:\w+(?=[\s\w+:\w+="[\w/:.-]+"]{0,10}>)')

edited Oct 27, 2012 at 18:21

Emil

7,25618 gold badges80 silver badges135 bronze badges

answered Oct 27, 2012 at 18:00

devdc

1611 gold badge4 silver badges13 bronze badges

Collectives™ on Stack Overflow

Python XML regular expression matching issue

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related