I've been trying to get data from a specifix XML file into a SQLite3 database using ElementTree. The XML has the following structure:
<?xml version="1.0" encoding="UTF-8" ?>
<chat xmlns="http://test.org/net/1.3">
<event sender="Frank" time="2016-02-03T22:58:19+01:00" />
<message sender="Karen" time="2016-02-03T22:58:19+01:00">
<div>
<span>Hello Frank</span>
</div>
</message>
<message sender="Frank" time="2016-02-03T22:58:39+01:00">
<div>
<span>Hi there Karen</span>
</div>
<div>
<span>I'm back from New York</span>
</div>
</message>
<message sender="Karen" time="2016-02-03T22:58:56+01:00">
<div>
<span>How are you doing?</span>
<span>Everything OK?</span>
</div>
</message>
</chat>
For each message or event I create a record in the database with the following columns: sender, time, message. The following code is used to process the XML:
import xml.etree.ElementTree as ET
import sqlite3 as lite
con = None
con = lite.connect('dbtest.db')
cur = con.cursor()
xmlfile = 'test.xml'
tree = ET.parse(xmlfile)
root = tree.getroot()
for m in root.findall('./*'):
msg = m.find('.')
msg.tag = 'div'
sender = str(m.get('sender'))
time = m.get('time')
message = str(ET.tostring(msg))
print('Sender: ' + sender)
print('Time: ' + time)
print('HTML: ' + message)
print()
query = ("INSERT INTO chat('time', 'sender', 'message') VALUES(?,?,?)")
values = (time, sender, message)
with con:
cur = con.cursor()
cur.execute(query, values)
if con:
con.close()
This results in several problems.
First of all I don't get the result I want. The "message" should be what's inside the message tag, not including the enclosing message tag, now renamed to div. This is what I should get:
<div>
<span>Hi there Karen</span>
</div>
<div>
<span>I'm back from New York</span>
</div>
Or maybe this:
<div><span>Hi there Karen</span></div><div><span>I'm back from New York</span></div>
Instead I get this:
b'<div xmlns:ns0="http://test.org/net/1.3" sender="Karen" time="2016-02-03T22:58:19+01:00">\n\t\t<ns0:div>\n\t\t\t<ns0:span>Hello Frank</ns0:span>\n\t\t</ns0:div>\n\t</div>\n\t'
So I'm trying to "fix" this, by removing the b' etc, but I hope there is a better method. And removing that starting b' works, but I can't get rid of the \t and \n somehow, using string replace.
Question
How can I get proper XML data into the table without all those escape characters?
b'and ends with'. When I put this into the database, I don't see the use of this. It's plain text in the database AFAIK.