python parse XML with lxml giving the right parser parameters

Question

I am parsing xml data with lxml in python

The data looks like this:

string='''<?xml version="1.0" encoding="UTF-8"?>/n
    <div type="request" xml:base="/k-api/7728" xml:lang="en" >
    <div n="" type="request" xml:id="_54f59d0003">
        <p xml:id="_54f59d0004"/>
        <p xml:id="_54f59d0005">Requests </p>
    </div>
    <div n="0001" type="request" xml:id="_54f59d0006">
        <p xml:id="_54f59d0007">1.  First request.
        </p>
    </div>
    <div n="0002" type="claim" xml:id="_54f59d0008">
         <p xml:id="_54f59d0009">2. Second request.
         </p>
    </div>
    <div n="0003" type="request" xml:id="_54f59d0010">
         <p xml:id="_54f59d0011">3. Thrid requests.
         </p>
    </div>
    <div n="0004" type="request" xml:id="_54f59d0012">
        <p xml:id="_54f59d0013">4. request.
        </p>
    </div>
</div>'''


import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(encoding="UTF-8", resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(xml_string,parser=parser)

This does not work because several reasons a) the line break \n: I can solve that by

xml_string = ''.join(string.splitlines())

but I am wondering if there is a way to tell in the parser that lxml should not take care of line breaks b) Utf-8 first line in the string. I can also take care of it by:

xml_string = xml_string.replace('<?xml version="1.0" encoding="UTF-8"?>','')

before parsing, but is there a way to do it all inside the lxml parser?, i.e telling the parser to remove line breaks and to forget about the encoding (note: encoding="UTF-8" or encoding=None will not solve the problem)

Thanks

EDIT 1: The rror that I get when not removing the encoding bit is: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Your code works fine for me, using triple quotes around string and XML_tree = etree.fromstring(string.encode('utf-8'), parser=parser) — Maurice Meyer
– Maurice Meyer, Commented Mar 23, 2021 at 11:27
Have a look at stackoverflow.com/questions/28534460/… re the encoding. — yvesonline
– yvesonline, Commented Mar 23, 2021 at 11:46

CodeMonkey · Accepted Answer · 2021-05-31 19:55:19Z

1

etree.fromstring() function should have the XML string input encoded as bytes to parse correctly if the XML fragment includes the XML declaration.

Alternatively, can use ElementTree.fromstring() function.

import xml.etree.ElementTree as ET
from lxml import etree

xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
<div...>
</div>'''

parser = etree.XMLParser(encoding="UTF-8", resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)

# Option 1
root = etree.fromstring(xml_string.encode('utf-8'), parser)

# Option 2
root = ET.fromstring(xml_string, parser)

# do something with the parsed XML

pretty_xml = etree.tostring(root, pretty_print=True, encoding=str)
print(pretty_xml)

edited May 31, 2021 at 19:55

answered May 31, 2021 at 17:02

CodeMonkey

23.9k4 gold badges37 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

python parse XML with lxml giving the right parser parameters

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related