0

I am parsing xml data with lxml in python

The data looks like this:

string='''<?xml version="1.0" encoding="UTF-8"?>/n
    <div type="request" xml:base="/k-api/7728" xml:lang="en" >
    <div n="" type="request" xml:id="_54f59d0003">
        <p xml:id="_54f59d0004"/>
        <p xml:id="_54f59d0005">Requests </p>
    </div>
    <div n="0001" type="request" xml:id="_54f59d0006">
        <p xml:id="_54f59d0007">1.  First request.
        </p>
    </div>
    <div n="0002" type="claim" xml:id="_54f59d0008">
         <p xml:id="_54f59d0009">2. Second request.
         </p>
    </div>
    <div n="0003" type="request" xml:id="_54f59d0010">
         <p xml:id="_54f59d0011">3. Thrid requests.
         </p>
    </div>
    <div n="0004" type="request" xml:id="_54f59d0012">
        <p xml:id="_54f59d0013">4. request.
        </p>
    </div>
</div>'''


import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(encoding="UTF-8", resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(xml_string,parser=parser)

This does not work because several reasons a) the line break \n: I can solve that by

xml_string = ''.join(string.splitlines())

but I am wondering if there is a way to tell in the parser that lxml should not take care of line breaks b) Utf-8 first line in the string. I can also take care of it by:

xml_string = xml_string.replace('<?xml version="1.0" encoding="UTF-8"?>','')

before parsing, but is there a way to do it all inside the lxml parser?, i.e telling the parser to remove line breaks and to forget about the encoding (note: encoding="UTF-8" or encoding=None will not solve the problem)

Thanks

EDIT 1: The rror that I get when not removing the encoding bit is: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

2
  • 2
    Your code works fine for me, using triple quotes around string and XML_tree = etree.fromstring(string.encode('utf-8'), parser=parser) Commented Mar 23, 2021 at 11:27
  • 2
    Have a look at stackoverflow.com/questions/28534460/… re the encoding. Commented Mar 23, 2021 at 11:46

1 Answer 1

1

etree.fromstring() function should have the XML string input encoded as bytes to parse correctly if the XML fragment includes the XML declaration.

Alternatively, can use ElementTree.fromstring() function.

import xml.etree.ElementTree as ET
from lxml import etree

xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
<div...>
</div>'''

parser = etree.XMLParser(encoding="UTF-8", resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)

# Option 1
root = etree.fromstring(xml_string.encode('utf-8'), parser)

# Option 2
root = ET.fromstring(xml_string, parser)

# do something with the parsed XML

pretty_xml = etree.tostring(root, pretty_print=True, encoding=str)
print(pretty_xml)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.