0

I need to clean up an xml file before I can process it. The file has junk at the start and end, and then junk in between elements. Here is an example file:

junkjunkjunkjunk<root>
\par junkjunkjunkjunkjunk<level1>useful info to keep</level1>
</root>
junkjunkjunkjunk

How do I use regex to cut out (with replace?) the start and end junk, and then the middle junk? The middle junk always starts with "\par ...".

2

1 Answer 1

2

The following statements should remove the junk (assuming your document is stored in a variable called xml):

import re

xml = re.sub(r'.*<root>', '<root>', xml, flags=re.DOTALL)    # Remove leading junk
xml = re.sub(r'\\par[^<]*<', '<', xml)                       # Middle junk
xml = re.sub(r'</root>.*', '</root>', xml, flags=re.DOTALL)  # Trailing junk

Note that this assumes you know the name of the root element (and in this case, it's called root), otherwise you may need to adjust this slightly.

Sign up to request clarification or add additional context in comments.

1 Comment

re.DOTALL flag is missing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.