0

I have been trying to match tag names only (without the < and > signs) is cases of regular tags:

<w:tag w:attrib1="http://url" w:attrib2="anyValue">

without matching solo tags (opening-closing tags):

<w:tag2 w:attrib1="anyValue" w:attrib2="http://url" />

(please pay attention to the URLs in the attributes as they contain forward slashes (/))

but could not manage to get to it with:

regex = re.compile('(?<=<)w:\w+(?=[\w\W]+>)(?!\s/>)')

print(regex.findall(string))

getting this:

['w:tag','w:tag2']

expecting this:

['w:tag']

any thoughts?

Cheers.

2 Answers 2

1

1) Go easy on the lookahead/lookbehind; they're hard to control and you rarely really need them. Use capturing groups to extract part of the matched string. Use negative character classes and non-greedy search (if needed) to avoid matching too much:

re.findall(r'<\s*(w:\w+)[^>]*(?<!/)>', string)

Easier to read, isn't it? However,

2) Don't do this at all! Don't rely on REs to match XML or html, you're just asking for heartbreak. See https://stackoverflow.com/a/1732454/699305 for the details. :-) Get familiar with using python's xml.etree.ElementTree with xpath expressions instead. It'll take some getting used to, but it will be time well spent-- you won't regret it.

Sign up to request clarification or add additional context in comments.

1 Comment

I know XML and lxml too well and loving it. Although this time I Need to handle some broken stuff... thanks for your detailed answer. It works like charm and indeed looks better than what I've come up with.
0

Found it:

regex = re.compile('(?<=<)w:\w+(?=>)|(?<=<)w:\w+(?=[\s\w+:\w+="[\w/:.-]+"]{0,10}>)')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.