Process xml input with python regex

Question

I need to clean up an xml file before I can process it. The file has junk at the start and end, and then junk in between elements. Here is an example file:

junkjunkjunkjunk<root>
\par junkjunkjunkjunkjunk<level1>useful info to keep</level1>
</root>
junkjunkjunkjunk

How do I use regex to cut out (with replace?) the start and end junk, and then the middle junk? The middle junk always starts with "\par ...".

related: Can I get lxml to ignore non-XML content before and after the root tag? — jfs
– jfs, Commented Jun 24, 2013 at 21:14

robjohncox · Accepted Answer · 2013-06-24 21:21:48Z

2

The following statements should remove the junk (assuming your document is stored in a variable called xml):

import re

xml = re.sub(r'.*<root>', '<root>', xml, flags=re.DOTALL)    # Remove leading junk
xml = re.sub(r'\\par[^<]*<', '<', xml)                       # Middle junk
xml = re.sub(r'</root>.*', '</root>', xml, flags=re.DOTALL)  # Trailing junk

Note that this assumes you know the name of the root element (and in this case, it's called root), otherwise you may need to adjust this slightly.

edited Jun 24, 2013 at 21:21

answered Jun 24, 2013 at 15:08

robjohncox

3,6733 gold badges27 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jfs Over a year ago

re.DOTALL flag is missing.

Collectives™ on Stack Overflow

Process xml input with python regex

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related