0

I'm getting a XML object from a website that looks like this:

<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
    <changes>
        <update id="loginForm:tabelaProcessos">
            <![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
        </update>
        <update id="j_id1:javax.faces.ViewState:0">
            <![CDATA[-8530455S7417:3382887371AS10732]]>
        </update>
        <extension ln="primefaces" type="args">{"totalRecords":1}</extension>
    </changes>
</partial-response>

I need to parse the table rows inside CDATA. I tried to use it as input to lxml.html.fromstring() but the output provided ignores CDATA content. Any way to get everything inside the CDATA using lxml or other Python lib?

1 Answer 1

1

Use BeautifulSoup. CData is a subclass of a NavigableString.

import bs4

data = """<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<partial-response id="j_id1">
    <changes>
        <update id="loginForm:tabelaProcessos">
            <![CDATA[<tr data-ri="5" class="ui-widget-content ui-datatable-odd" role="row"><td role="gridcell" style="word-break:break-all;"><span style="font-size:7pt;text-align: center;" title="XPT">08454.8100</span></td><td role="gridcell"><span style="font-size:7pt;" title="tDFvo">ARÁ</span></td><td role="gridcell"><span style="font-size:7pt;" title="PDSDo">TA15A</span></td><td role="gridcell"><span style="font-size:7pt;" title="P125ão">MINIRAL</span></td><td role="gridcell"><span style="font-size:7pt;" title="A12o">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="O4545ão">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="A45So">- </span></td><td role="gridcell"><span style="font-size:7pt;" title="ASD1vo">-</span></td><td role="gridcell"><span style="font-size:7pt;" title="D45el">18/02/2021 04:35:30</span></td><td role="gridcell"><span style="font-size:7pt;" title="Idto">405833357</span></td></tr>]]>
        </update>
        <update id="j_id1:javax.faces.ViewState:0">
            <![CDATA[-8530455S7417:3382887371AS10732]]>
        </update>
        <extension ln="primefaces" type="args">{"totalRecords":1}</extension>
    </changes>
</partial-response>"""

soup = bs4.BeautifulSoup(data, 'html.parser')

for cd in soup.findAll(text=True):
    if isinstance(cd, bs4.CData):
        print('CData contents: %r' % cd)

reference: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.