Python parse XML files with HTML content

Question

I use an API to get some XML files but some of them contain HTML tags without escaping them. For example,   or 

I use this code to read them, but the files with the HTML raise an error. I don't have access to change manually all the files. Is there any way to parse the file without losing the HTML tags?

from xml.dom.minidom import parse, parseString

xml = ...#here is the api to receive the xml file
dom = parse(xml)
strings = dom.getElementsByTagName("string")

How about replacing   with   before parsing the xml? And I don't see what's wrong with ? Also, consider using ElementTree instead of minidom; minidom can cause memory leaks. — Aran-Fey
– Aran-Fey, Commented Jan 29, 2015 at 10:43
The xml variable is a file_path. So, how can I replace the tag before parsing? Can you give an example of this code with ElementTree and replacing as an answer, to see if it works and accept your solution? — Tasos
– Tasos, Commented Jan 29, 2015 at 10:49

Aran-Fey · Accepted Answer · 2015-01-29 11:06:48Z

2

Read the xml file as a string, and fix the malformed tags before you parse it:

import xml.etree.ElementTree as ET

with open(xml) as xml_file: # open the xml file for reading
    text= xml_file.read() # read its contents
text= text.replace('<br>', '<br />') # fix malformed tags
document= ET.fromstring(text) # parse the string
strings= document.findall('string') # find all string elements

answered Jan 29, 2015 at 11:06

Aran-Fey

44k13 gold badges113 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Tasos Over a year ago

For a reason I cannot understand, if I use text = text.replace(' ', ' ') the string after the tag disappear. If I use text = text.replace(' ', "") all the string is there, but obviously without the new line

sepulchered · Accepted Answer · 2015-01-29 11:22:16Z

2

If you can use third-party libs I suggest you to use Beautiful Soup it can handle xml as well as html and also it parses broken markup, also providing easy to use api.

answered Jan 29, 2015 at 11:22

sepulchered

8148 silver badges18 bronze badges

Collectives™ on Stack Overflow

Python parse XML files with HTML content

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related