Parsing Data in XML and Storing to DB in Python

Question

Hi Guys i have problem parsing an xml file and entering the data to sqlite, the format is like i need to enter the chracters before the token like 111,AAA,BBB etc

<DOCUMENT>
<PAGE width="544.252" height="634.961" number="1" id="p1">
<MEDIABOX x1="0" y1="0" x2="544.252" y2="634.961"/>

<BLOCK id="p1_b1">

<TEXT width="37.7" height="74.124" id="p1_t1" x="51.1" y="20.8652">
<TOKEN sid="p1_s11" id="p1_w1" font-name="Verdanae" bold="yes" italic="no">111</TOKEN>
</TEXT>
</BLOCK>

<BLOCK id="p1_b3">

<TEXT width="151.267" height="10.725" id="p1_t6" x="24.099" y="572.096">
<TOKEN sid="p1_s35" id="p1_w22" font-name="Verdanae" bold="yes"     italic="yes">AAA</TOKEN>
<TOKEN sid="p1_s36" id="p1_w23" font-name="verdanae" bold="yes" italic="no">BBB</TOKEN>
<TOKEN sid="p1_s37" id="p1_w24" font-name="verdanae" bold="yes" italic="no">CCC</TOKEN>
</TEXT>
</BLOCK>

<BLOCK id="p1_b4">

<TEXT width="82.72" height="26" id="p1_t7" x="55.426" y="138.026">
<TOKEN sid="p1_s42" id="p1_w29" font-name="verdanae" bold="yes" italic="no">DDD</TOKEN>
<TOKEN sid="p1_s43" id="p1_w30" font-name="verdanae" bold="yes" italic="no">EEE</TOKEN>
</TEXT>

<TEXT width="101.74" height="26" id="p1_t8" x="55.406" y="162.026">
<TOKEN sid="p1_s45" id="p1_w31" font-name="verdanae" bold="yes" italic="no">FFF</TOKEN>
</TEXT>

<TEXT width="152.96" height="26" id="p1_t9" x="55.406" y="186.026">
<TOKEN sid="p1_s47" id="p1_w32" font-name="verdanae" bold="yes" italic="no">GGG</TOKEN>
<TOKEN sid="p1_s48" id="p1_w33" font-name="verdanae" bold="yes" italic="no">HHH</TOKEN>
</TEXT>
</BLOCK>
</PAGE>
</DOCUMENT>

in .net it is done with 3 foreach loops 1. for "DOCUMENT/PAGE/BLOCK" 2."TEXT" 3. "TOKEN" and then it is entered into the DB i dont get how to do it in python and i am trying it with lxml module

you mean you need get all token values? like ['111', 'BBB', 'EEE'] or [['111'], ['BBB', 'EEE']] — virhilo
– virhilo, Commented Jan 9, 2011 at 10:40

virhilo · Accepted Answer · 2011-01-09 10:57:09Z

1

you mean this?:

>>> xml = """<DOCUMENT>
... <PAGE width="544.252" height="634.961" number="1" id="p1">
... <MEDIABOX x1="0" y1="0" x2="544.252" y2="634.961"/>
... 
... <BLOCK id="p1_b1">
... 
... <TEXT width="37.7" height="74.124" id="p1_t1" x="51.1" y="20.8652">
... <TOKEN sid="p1_s11" id="p1_w1" font-name="Verdanae" bold="yes" italic="no">111</TOKEN>
... </TEXT>
... </BLOCK>
... 
... <BLOCK id="p1_b3">
... 
... <TEXT width="151.267" height="10.725" id="p1_t6" x="24.099" y="572.096">
... <TOKEN sid="p1_s35" id="p1_w22" font-name="Verdanae" bold="yes"     italic="yes">AAA</TOKEN>
... <TOKEN sid="p1_s36" id="p1_w23" font-name="verdanae" bold="yes" italic="no">BBB</TOKEN>
... <TOKEN sid="p1_s37" id="p1_w24" font-name="verdanae" bold="yes" italic="no">CCC</TOKEN>
... </TEXT>
... </BLOCK>
... 
... <BLOCK id="p1_b4">
... 
... <TEXT width="82.72" height="26" id="p1_t7" x="55.426" y="138.026">
... <TOKEN sid="p1_s42" id="p1_w29" font-name="verdanae" bold="yes" italic="no">DDD</TOKEN>
... <TOKEN sid="p1_s43" id="p1_w30" font-name="verdanae" bold="yes" italic="no">EEE</TOKEN>
... </TEXT>
... 
... <TEXT width="101.74" height="26" id="p1_t8" x="55.406" y="162.026">
... <TOKEN sid="p1_s45" id="p1_w31" font-name="verdanae" bold="yes" italic="no">FFF</TOKEN>
... </TEXT>
... 
... <TEXT width="152.96" height="26" id="p1_t9" x="55.406" y="186.026">
... <TOKEN sid="p1_s47" id="p1_w32" font-name="verdanae" bold="yes" italic="no">GGG</TOKEN>
... <TOKEN sid="p1_s48" id="p1_w33" font-name="verdanae" bold="yes" italic="no">HHH</TOKEN>
... </TEXT>
... </BLOCK>
... </PAGE>
... </DOCUMENT>"""
>>> from lxml import etree
>>> parsed = etree.fromstring(xml)
>>> tokens = parsed.xpath('//TOKEN/text()')
>>> tokens
['111', 'AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG', 'HHH']
>>>

or this?:

>>> parsed = etree.fromstring(xml)
>>> for block in parsed.xpath('//PAGE/BLOCK/TEXT'):
...     print block.xpath('./TOKEN/text()')
... 
['111']
['AAA', 'BBB', 'CCC']
['DDD', 'EEE']
['FFF']
['GGG', 'HHH']
>>>

edited Jan 9, 2011 at 10:57

answered Jan 9, 2011 at 10:50

virhilo

6,8232 gold badges32 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rakesh Over a year ago

I tried it out with the same method but i got an empty list it was because . was not added to "/TOKEN/text()" why do you add the dot what does it do..... anyway thanks a lot dude

virhilo Over a year ago

the dot means relative path from 'here', where here is current DOCUMENT/PAGE/BLOCK/TEXT element '/' without dot will start from the document root of course you can remove the './' part and you 'll have the same;) google xpath for more of it's powerful syntax, and please mark my answer as accepted if't what your mean:)

Collectives™ on Stack Overflow

Parsing Data in XML and Storing to DB in Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related