0

Hi Guys i have problem parsing an xml file and entering the data to sqlite, the format is like i need to enter the chracters before the token like 111,AAA,BBB etc

<DOCUMENT>
<PAGE width="544.252" height="634.961" number="1" id="p1">
<MEDIABOX x1="0" y1="0" x2="544.252" y2="634.961"/>

<BLOCK id="p1_b1">

<TEXT width="37.7" height="74.124" id="p1_t1" x="51.1" y="20.8652">
<TOKEN sid="p1_s11" id="p1_w1" font-name="Verdanae" bold="yes" italic="no">111</TOKEN>
</TEXT>
</BLOCK>

<BLOCK id="p1_b3">

<TEXT width="151.267" height="10.725" id="p1_t6" x="24.099" y="572.096">
<TOKEN sid="p1_s35" id="p1_w22" font-name="Verdanae" bold="yes"     italic="yes">AAA</TOKEN>
<TOKEN sid="p1_s36" id="p1_w23" font-name="verdanae" bold="yes" italic="no">BBB</TOKEN>
<TOKEN sid="p1_s37" id="p1_w24" font-name="verdanae" bold="yes" italic="no">CCC</TOKEN>
</TEXT>
</BLOCK>

<BLOCK id="p1_b4">

<TEXT width="82.72" height="26" id="p1_t7" x="55.426" y="138.026">
<TOKEN sid="p1_s42" id="p1_w29" font-name="verdanae" bold="yes" italic="no">DDD</TOKEN>
<TOKEN sid="p1_s43" id="p1_w30" font-name="verdanae" bold="yes" italic="no">EEE</TOKEN>
</TEXT>

<TEXT width="101.74" height="26" id="p1_t8" x="55.406" y="162.026">
<TOKEN sid="p1_s45" id="p1_w31" font-name="verdanae" bold="yes" italic="no">FFF</TOKEN>
</TEXT>

<TEXT width="152.96" height="26" id="p1_t9" x="55.406" y="186.026">
<TOKEN sid="p1_s47" id="p1_w32" font-name="verdanae" bold="yes" italic="no">GGG</TOKEN>
<TOKEN sid="p1_s48" id="p1_w33" font-name="verdanae" bold="yes" italic="no">HHH</TOKEN>
</TEXT>
</BLOCK>
</PAGE>
</DOCUMENT>

in .net it is done with 3 foreach loops 1. for "DOCUMENT/PAGE/BLOCK" 2."TEXT" 3. "TOKEN" and then it is entered into the DB i dont get how to do it in python and i am trying it with lxml module

1
  • you mean you need get all token values? like ['111', 'BBB', 'EEE'] or [['111'], ['BBB', 'EEE']] Commented Jan 9, 2011 at 10:40

1 Answer 1

1

you mean this?:

>>> xml = """<DOCUMENT>
... <PAGE width="544.252" height="634.961" number="1" id="p1">
... <MEDIABOX x1="0" y1="0" x2="544.252" y2="634.961"/>
... 
... <BLOCK id="p1_b1">
... 
... <TEXT width="37.7" height="74.124" id="p1_t1" x="51.1" y="20.8652">
... <TOKEN sid="p1_s11" id="p1_w1" font-name="Verdanae" bold="yes" italic="no">111</TOKEN>
... </TEXT>
... </BLOCK>
... 
... <BLOCK id="p1_b3">
... 
... <TEXT width="151.267" height="10.725" id="p1_t6" x="24.099" y="572.096">
... <TOKEN sid="p1_s35" id="p1_w22" font-name="Verdanae" bold="yes"     italic="yes">AAA</TOKEN>
... <TOKEN sid="p1_s36" id="p1_w23" font-name="verdanae" bold="yes" italic="no">BBB</TOKEN>
... <TOKEN sid="p1_s37" id="p1_w24" font-name="verdanae" bold="yes" italic="no">CCC</TOKEN>
... </TEXT>
... </BLOCK>
... 
... <BLOCK id="p1_b4">
... 
... <TEXT width="82.72" height="26" id="p1_t7" x="55.426" y="138.026">
... <TOKEN sid="p1_s42" id="p1_w29" font-name="verdanae" bold="yes" italic="no">DDD</TOKEN>
... <TOKEN sid="p1_s43" id="p1_w30" font-name="verdanae" bold="yes" italic="no">EEE</TOKEN>
... </TEXT>
... 
... <TEXT width="101.74" height="26" id="p1_t8" x="55.406" y="162.026">
... <TOKEN sid="p1_s45" id="p1_w31" font-name="verdanae" bold="yes" italic="no">FFF</TOKEN>
... </TEXT>
... 
... <TEXT width="152.96" height="26" id="p1_t9" x="55.406" y="186.026">
... <TOKEN sid="p1_s47" id="p1_w32" font-name="verdanae" bold="yes" italic="no">GGG</TOKEN>
... <TOKEN sid="p1_s48" id="p1_w33" font-name="verdanae" bold="yes" italic="no">HHH</TOKEN>
... </TEXT>
... </BLOCK>
... </PAGE>
... </DOCUMENT>"""
>>> from lxml import etree
>>> parsed = etree.fromstring(xml)
>>> tokens = parsed.xpath('//TOKEN/text()')
>>> tokens
['111', 'AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG', 'HHH']
>>> 

or this?:

>>> parsed = etree.fromstring(xml)
>>> for block in parsed.xpath('//PAGE/BLOCK/TEXT'):
...     print block.xpath('./TOKEN/text()')
... 
['111']
['AAA', 'BBB', 'CCC']
['DDD', 'EEE']
['FFF']
['GGG', 'HHH']
>>> 
Sign up to request clarification or add additional context in comments.

2 Comments

I tried it out with the same method but i got an empty list it was because . was not added to "/TOKEN/text()" why do you add the dot what does it do..... anyway thanks a lot dude
the dot means relative path from 'here', where here is current DOCUMENT/PAGE/BLOCK/TEXT element '/' without dot will start from the document root of course you can remove the './' part and you 'll have the same;) google xpath for more of it's powerful syntax, and please mark my answer as accepted if't what your mean:)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.