3

I use an API to get some XML files but some of them contain HTML tags without escaping them. For example, <br> or <b></b>

I use this code to read them, but the files with the HTML raise an error. I don't have access to change manually all the files. Is there any way to parse the file without losing the HTML tags?

from xml.dom.minidom import parse, parseString

xml = ...#here is the api to receive the xml file
dom = parse(xml)
strings = dom.getElementsByTagName("string")
2
  • How about replacing <br> with <br /> before parsing the xml? And I don't see what's wrong with <b></b>? Also, consider using ElementTree instead of minidom; minidom can cause memory leaks. Commented Jan 29, 2015 at 10:43
  • The xml variable is a file_path. So, how can I replace the tag before parsing? Can you give an example of this code with ElementTree and replacing as an answer, to see if it works and accept your solution? Commented Jan 29, 2015 at 10:49

2 Answers 2

2

Read the xml file as a string, and fix the malformed tags before you parse it:

import xml.etree.ElementTree as ET

with open(xml) as xml_file: # open the xml file for reading
    text= xml_file.read() # read its contents
text= text.replace('<br>', '<br />') # fix malformed tags
document= ET.fromstring(text) # parse the string
strings= document.findall('string') # find all string elements
Sign up to request clarification or add additional context in comments.

1 Comment

For a reason I cannot understand, if I use text = text.replace('<br>', '<br />') the string after the tag disappear. If I use text = text.replace('<br>', "") all the string is there, but obviously without the new line
2

If you can use third-party libs I suggest you to use Beautiful Soup it can handle xml as well as html and also it parses broken markup, also providing easy to use api.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.