Python Regex Question

Question

I have an end tag followed by a carriage return line feed (x0Dx0A) followd by one or more tabs (x09) followed by a new start tag .

Something like this:

</tag1>x0Dx0Ax09x09x09<tag2> or </tag1>x0Dx0Ax09x09x09x09x09<tag2>

What Python regex should I use to replace it with something like this:

</tag1><tag3>content</tag3><tag2>

Thanks in advance.

Parsing XML yourself? Not a good idea. IT seams that you will have additional problems porting your code to Python 3. How about trying to use an existing xml parsing solutions instead? — sorin
– sorin, Commented Jul 23, 2010 at 21:25

John Machin · Accepted Answer · 2010-07-24 01:26:29Z

1

Here is code for something like what you say that you need:

>>> import re
>>> sample = '</tag1>\r\n\t\t\t\t<tag2>'
>>> sample
'</tag1>\r\n\t\t\t\t<tag2>'
>>> pattern = '(</tag1>)\r\n\t+(<tag2>)'
>>> replacement = r'\1<tag3>content</tag3>\2'
>>> re.sub(pattern, replacement, sample)
'</tag1><tag3>content</tag3><tag2>'
>>>

Note that \r\n\t+ may be a bit too specific, especially if production of your input is not under your control. It may be better to adopt the much more general \s* (zero or more whitespace characters).

Using regexes to parse XML and HTML is not a good idea in general ... while it's hard to see a failure mode here (apart from elementary errors in getting the pattern correct), you might like to tell us what the underlying problem is, in case some other solution is better.

answered Jul 24, 2010 at 1:26

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Regex Question

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related