0

I've been trying to get data from a specifix XML file into a SQLite3 database using ElementTree. The XML has the following structure:

<?xml version="1.0" encoding="UTF-8" ?>
<chat xmlns="http://test.org/net/1.3">
    <event sender="Frank" time="2016-02-03T22:58:19+01:00" />
    <message sender="Karen" time="2016-02-03T22:58:19+01:00">
        <div>
            <span>Hello Frank</span>
        </div>
    </message>
    <message sender="Frank" time="2016-02-03T22:58:39+01:00">
        <div>
            <span>Hi there Karen</span>
        </div>
        <div>
            <span>I'm back from New York</span>
        </div>
    </message>
    <message sender="Karen" time="2016-02-03T22:58:56+01:00">
        <div>
            <span>How are you doing?</span>
            <span>Everything OK?</span>
        </div>
    </message>
</chat>

For each message or event I create a record in the database with the following columns: sender, time, message. The following code is used to process the XML:

import xml.etree.ElementTree as ET
import sqlite3 as lite

con = None
con = lite.connect('dbtest.db')
cur = con.cursor()

xmlfile = 'test.xml'

tree = ET.parse(xmlfile)
root = tree.getroot()

for m in root.findall('./*'):
    msg = m.find('.')
    msg.tag = 'div'

    sender = str(m.get('sender'))
    time = m.get('time')
    message = str(ET.tostring(msg))

    print('Sender: ' + sender)
    print('Time: ' + time)
    print('HTML: ' + message)
    print()

    query = ("INSERT INTO chat('time', 'sender', 'message') VALUES(?,?,?)")
    values = (time, sender, message)

    with con:
        cur = con.cursor()
        cur.execute(query, values)

if con:
    con.close()

This results in several problems.

First of all I don't get the result I want. The "message" should be what's inside the message tag, not including the enclosing message tag, now renamed to div. This is what I should get:

<div>
    <span>Hi there Karen</span>
</div>
<div>
    <span>I'm back from New York</span>
</div>

Or maybe this:

<div><span>Hi there Karen</span></div><div><span>I'm back from New York</span></div>

Instead I get this:

b'<div xmlns:ns0="http://test.org/net/1.3" sender="Karen" time="2016-02-03T22:58:19+01:00">\n\t\t<ns0:div>\n\t\t\t<ns0:span>Hello Frank</ns0:span>\n\t\t</ns0:div>\n\t</div>\n\t'

So I'm trying to "fix" this, by removing the b' etc, but I hope there is a better method. And removing that starting b' works, but I can't get rid of the \t and \n somehow, using string replace.

Question

How can I get proper XML data into the table without all those escape characters?

3
  • 1
    What do you mean by remove the b? Decoding the bytestring into a normal python string? Commented Sep 20, 2018 at 21:14
  • Well maybe I'm misunderstanding things. The string starts with b' and ends with '. When I put this into the database, I don't see the use of this. It's plain text in the database AFAIK. Commented Sep 20, 2018 at 21:28
  • Yeah, you're not understanding something. I'll put up an answer with some explanation. Commented Sep 20, 2018 at 21:49

1 Answer 1

1

So, ElementTree.tostring returns a byte object by default, not a string. So when you print it out, you're seeing that byte object's serialized form, when what you expected and want is a string. I didn't look into it but I suspect that the sqlite binding will insert byte objects as BLOB and strings as TEXT values into the database and that byte by byte they end up being identical.

Anyways, to print out the xml in a more human readable form like you want:

import xml.etree.ElementTree as ET

rawxml='''<?xml version="1.0" encoding="UTF-8" ?>
<chat xmlns="http://test.org/net/1.3">
    <event sender="Frank" time="2016-02-03T22:58:19+01:00" />
    <message sender="Karen" time="2016-02-03T22:58:19+01:00">
        <div>
            <span>Hello Frank</span>
        </div>
    </message>
    <message sender="Frank" time="2016-02-03T22:58:39+01:00">
        <div>
            <span>Hi there Karen</span>
        </div>
        <div>
            <span>I'm back from New York</span>
        </div>
    </message>
    <message sender="Karen" time="2016-02-03T22:58:56+01:00">
        <div>
            <span>How are you doing?</span>
            <span>Everything OK?</span>
        </div>
    </message>
</chat>'''

ns={'msg' : "http://test.org/net/1.3"}
xml = ET.fromstring(rawxml)

for msg in xml.findall("msg:message", ns):
    print("Sender: " + msg.get("sender"))
    print("Time: " + msg.get("time"))
    body=""
    for d in msg.findall("msg:div", ns):
        body = body + ET.tostring(d, encoding="unicode")
    print("Content: " + body)

Note the use of the encoding="unicode" argument to tostring(), which makes it return a string. Adding the XML namespace attributes is just how ElementTree works with them.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! Excellent answer!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.