XML parsing from URL with python

Question

I am trying to parse a xml from an url.

So originaly my code looked like this:

from xml.dom import minidom                                          
xmldoc = minidom.parse('all.xml')  

Sensor0Elm = xmldoc.getElementsByTagName('t0')
Sensor1Elm = xmldoc.getElementsByTagName('t1')
Sensor2Elm = xmldoc.getElementsByTagName('t2')

Sensor0Elm = Sensor0Elm[0]
Sensor1Elm = Sensor1Elm[0]
Sensor2Elm = Sensor2Elm[0]

Sensor0 = Sensor0Elm.childNodes[0].data
Sensor1 = Sensor1Elm.childNodes[0].data
Sensor2 = Sensor2Elm.childNodes[0].data

Sensor0 = float(Sensor0)
Sensor1 = float(Sensor1)
Sensor2 = float(Sensor2)

In this case the xml what I intend to parse was on my local harddrive and it worked out perfectly!

The next step was to parse a xml from an url. A sensormeter from allnet constantly inserts xml data into the networkt which is over the following url with a browser accessible: 192.168.60.242/xml

this is the embedded xml:

<HTML><HEAD><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></HEAD><BODY><FORM><TEXTAREA COLS=132 ROWS=50><xml><data>
<devicename>ALL4000</devicename>
<n0>0</n0><t0> 1.27</t0><min0> 0.00</min0><max0> 2.55</max0><l0>-55</l0><h0>150</h0><s0>102</s0>
<n1>1</n1><t1> 2.53</t1><min1> 2.32</min1><max1> 10487.04</max1><l1>-55</l1><h1>150</h1><s1>102</s1>
<n2>2</n2><t2> 2.45</t2><min2> 0.00</min2><max2> 2.55</max2><l2>-55</l2><h2>150</h2><s2>102</s2>
<n3>3</n3><t3>-20480.00</t3><min3> 0.00</min3><max3> 5580.80</max3><l3>-55</l3><h3>150</h3><s3>0</s3>
<n4>4</n4><t4>-20480.00</t4><min4> 40.96</min4><max4> 41943.04</max4><l4>-55</l4><h4>150</h4><s4>0</s4>
<n5>5</n5><t5>-20480.00</t5><min5> 10.24</min5><max5> 0.08</max5><l5>-55</l5><h5>150</h5><s5>0</s5>
<n6>6</n6><t6>-20480.00</t6><min6> 0.00</min6><max6>-20480.00</max6><l6>-55</l6><h6>150</h6><s6>0</s6>
<n7>7</n7><t7>-20480.00</t7><min7> 0.00</min7><max7> 0.00</max7><l7>-55</l7><h7>150</h7><s7>0</s7>
<n8>8</n8><t8>-20480.00</t8><min8> 336855.04</min8><max8> 1342177.28</max8><l8>-55</l8><h8>150</h8><s8>0</s8>
<n9>9</n9><t9>-20480.00</t9><min9> 0.00</min9><max9> 0.00</max9><l9>-55</l9><h9>150</h9><s9>0</s9>
<n10>10</n10><t10>-20480.00</t10><min10> 0.00</min10><max10> 0.00</max10><l10>-55</l10><h10>150</h10><s10>0</s10>
<n11>11</n11><t11>-20480.00</t11><min11> 0.00</min11><max11> 0.00</max11><l11>-55</l11><h11>150</h11><s11>0</s11>
<n12>12</n12><t12>-20480.00</t12><min12> 0.00</min12><max12> 0.00</max12><l12>-55</l12><h12>150</h12><s12>0</s12>
<n13>13</n13><t13>-20480.00</t13><min13> 0.00</min13><max13> 0.00</max13><l13>-55</l13><h13>150</h13><s13>0</s13>
<n14>14</n14><t14>-20480.00</t14><min14> 0.00</min14><max14> 0.00</max14><l14>-55</l14><h14>150</h14><s14>0</s14>
<n15>15</n15><t15>-20480.00</t15><min15> 0.00</min15><max15> 0.00</max15><l15>-55</l15><h15>150</h15><s15>0</s15>
<fn0>1</fn0><ft0>0</ft0><fs0>0</fs0>
<fn1>2</fn1><ft1>0</ft1><fs1>0</fs1>
<fn2>3</fn2><ft2>0</ft2><fs2>0</fs2>
<fn3>4</fn3><ft3>0</ft3><fs3>0</fs3>
<fn4>5</fn4><ft4>0</ft4><fs4>0</fs4>
<fn5>6</fn5><ft5>0</ft5><fs5>0</fs5>
<fn6>7</fn6><ft6>0</ft6><fs6>0</fs6>
<fn7>8</fn7><ft7>0</ft7><fs7>0</fs7>
<fn8>9</fn8><ft8>0</ft8><fs8>0</fs8>
<fn9>10</fn9><ft9>0</ft9><fs9>0</fs9>
<fn10>11</fn10><ft10>0</ft10><fs10>0</fs10>
<fn11>12</fn11><ft11>0</ft11><fs11>0</fs11>
<fn12>13</fn12><ft12>0</ft12><fs12>0</fs12>
<fn13>14</fn13><ft13>0</ft13><fs13>0</fs13>
<fn14>15</fn14><ft14>0</ft14><fs14>0</fs14>
<fn15>16</fn15><ft15>0</ft15><fs15>0</fs15>
<rn0>0</rn0><rt0>0</rt0>
<rn1>1</rn1><rt1>0</rt1>
<rn2>2</rn2><rt2>0</rt2>
<rn3>3</rn3><rt3>0</rt3>
<it0>248</it0><it1>254</it1><it2>255</it2><it3>255</it3><it4>128</it4><it5>1</it5><it6>255</it6><it7>255</it7>
<date>06.08.2006</date><time>03:27:49</time><ad>1</ad><ntpsync>-1</ntpsync><i>10</i><f>0</f>
<sys>18844128</sys><mem>25048</mem><fw>2.89</fw><dev>ALL4000</dev>
<sensorx>5</sensorx><sensory>3</sensory>
</data></xml>
</TEXTAREA></FORM></BODY></HTML>

So I changed the code into this:

import urllib
import time

while True:

### XML Extraction ###
from xml.dom import minidom

allxml = urllib.urlopen("http://192.168.60.242/xml")
allxml_string = allxml.read()
allxml.close()
print allxml_string

xmldoc = minidom.parseString(allxml_string)



Sensor0Elm = xmldoc.getElementsByTagName('t0')
Sensor1Elm = xmldoc.getElementsByTagName('t1')
Sensor2Elm = xmldoc.getElementsByTagName('t2')

Sensor0Elm = Sensor0Elm[0]
Sensor1Elm = Sensor1Elm[0]
Sensor2Elm = Sensor2Elm[0]

Sensor0 = Sensor0Elm.childNodes[0].data
Sensor1 = Sensor1Elm.childNodes[0].data
Sensor2 = Sensor2Elm.childNodes[0].data

Sensor0 = float(Sensor0)
Sensor1 = float(Sensor1)
Sensor2 = float(Sensor2)

Unfortunatelly it does not work. If executed, This is what gets returned: (by using the function print(), the xml is correctly inserted into te programm. the only problem seems to be a proper further processing by the parse function.) PLEASE LOOK AT THE ERROR MESSAGE ON THE BOTTOM

    Python 2.7.3 (default, Mar 18 2014, 05:13:23)
    [GCC 4.6.3] on linux2
    Type "copyright", "credits" or "license()" for more information.
    >>> ================================ RESTART ================================
    >>>
    <HTML><HEAD><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></HEAD><BODY><FORM><TEXTAREA COLS=132 ROWS=50><xml><data>
    <devicename>ALL4000</devicename>
    <n0>0</n0><t0> 1.09</t0><min0> 0.00</min0><max0> 2.55</max0><l0>-55</l0><h0>150</h0><s0>102</s0>
    <n1>1</n1><t1> 2.52</t1><min1> 2.32</min1><max1> 10487.04</max1><l1>-55</l1><h1>150</h1><s1>102</s1>
    <n2>2</n2><t2> 2.45</t2><min2> 0.00</min2><max2> 2.55</max2><l2>-55</l2><h2>150</h2><s2>102</s2>
    <n3>3</n3><t3>-20480.00</t3><min3> 0.00</min3><max3> 5580.80</max3><l3>-55</l3><h3>150</h3><s3>0</s3>
    <n4>4</n4><t4>-20480.00</t4><min4> 40.96</min4><max4> 41943.04</max4><l4>-55</l4><h4>150</h4><s4>0</s4>
    <n5>5</n5><t5>-20480.00</t5><min5> 10.24</min5><max5> 0.08</max5><l5>-55</l5><h5>150</h5><s5>0</s5>
    <n6>6</n6><t6>-20480.00</t6><min6> 0.00</min6><max6>-20480.00</max6><l6>-55</l6><h6>150</h6><s6>0</s6>
    <n7>7</n7><t7>-20480.00</t7><min7> 0.00</min7><max7> 0.00</max7><l7>-55</l7><h7>150</h7><s7>0</s7>
    <n8>8</n8><t8>-20480.00</t8><min8> 336855.04</min8><max8> 1342177.28</max8><l8>-55</l8><h8>150</h8><s8>0</s8>
    <n9>9</n9><t9>-20480.00</t9><min9> 0.00</min9><max9> 0.00</max9><l9>-55</l9><h9>150</h9><s9>0</s9>
    <n10>10</n10><t10>-20480.00</t10><min10> 0.00</min10><max10> 0.00</max10><l10>-55</l10><h10>150</h10><s10>0</s10>
    <n11>11</n11><t11>-20480.00</t11><min11> 0.00</min11><max11> 0.00</max11><l11>-55</l11><h11>150</h11><s11>0</s11>
    <n12>12</n12><t12>-20480.00</t12><min12> 0.00</min12><max12> 0.00</max12><l12>-55</l12><h12>150</h12><s12>0</s12>
    <n13>13</n13><t13>-20480.00</t13><min13> 0.00</min13><max13> 0.00</max13><l13>-55</l13><h13>150</h13><s13>0</s13>
    <n14>14</n14><t14>-20480.00</t14><min14> 0.00</min14><max14> 0.00</max14><l14>-55</l14><h14>150</h14><s14>0</s14>
    <n15>15</n15><t15>-20480.00</t15><min15> 0.00</min15><max15> 0.00</max15><l15>-55</l15><h15>150</h15><s15>0</s15>
    <fn0>1</fn0><ft0>0</ft0><fs0>0</fs0>
    <fn1>2</fn1><ft1>0</ft1><fs1>0</fs1>
    <fn2>3</fn2><ft2>0</ft2><fs2>0</fs2>
    <fn3>4</fn3><ft3>0</ft3><fs3>0</fs3>
    <fn4>5</fn4><ft4>0</ft4><fs4>0</fs4>
    <fn5>6</fn5><ft5>0</ft5><fs5>0</fs5>
    <fn6>7</fn6><ft6>0</ft6><fs6>0</fs6>
    <fn7>8</fn7><ft7>0</ft7><fs7>0</fs7>
    <fn8>9</fn8><ft8>0</ft8><fs8>0</fs8>
    <fn9>10</fn9><ft9>0</ft9><fs9>0</fs9>
    <fn10>11</fn10><ft10>0</ft10><fs10>0</fs10>
    <fn11>12</fn11><ft11>0</ft11><fs11>0</fs11>
    <fn12>13</fn12><ft12>0</ft12><fs12>0</fs12>
    <fn13>14</fn13><ft13>0</ft13><fs13>0</fs13>
    <fn14>15</fn14><ft14>0</ft14><fs14>0</fs14>
    <fn15>16</fn15><ft15>0</ft15><fs15>0</fs15>
    <rn0>0</rn0><rt0>0</rt0>
    <rn1>1</rn1><rt1>0</rt1>
    <rn2>2</rn2><rt2>0</rt2>
    <rn3>3</rn3><rt3>0</rt3>
    <it0>248</it0><it1>254</it1><it2>255</it2><it3>255</it3><it4>128</it4><it5>1</it5><it6>255</it6><it7>255</it7>
    <date>06.08.2006</date><time>06:45:46</time><ad>1</ad><ntpsync>-1</ntpsync><i>10</i><f>0</f>
    <sys>18856004</sys><mem>25048</mem><fw>2.89</fw><dev>ALL4000</dev>
    <sensorx>5</sensorx><sensory>3</sensory>
    </data></xml>
    </TEXTAREA></FORM></BODY></HTML>

  Traceback (most recent call last):
File "/home/pi/Desktop/sig_v3.py", line 14, in <module>
xmldoc = minidom.parseString(allxml_string)
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1930, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
ExpatError: mismatched tag: line 1, column 86

I hope someone can help me out on this.

thanks

See how the <meta> tag isn't closed? That's your mismatched tag. If that URL always gives you malformed XML, you'll have to edit it (manually add the text </meta> in the right place) before it will parse. — rmunn
– rmunn, Commented Nov 6, 2015 at 6:57
... or change the meta phrase to look like "<meta .... />". I suspect this is what the XML author intended. — Ira Baxter
– Ira Baxter, Commented Nov 6, 2015 at 7:31
Any suggestions how to accomplish that? Its my fthird day with python. Some code would be nice — Dr. Brackish Okun
– Dr. Brackish Okun, Commented Nov 6, 2015 at 7:40

Joe Young · Accepted Answer · 2015-11-05 18:07:32Z

1

Instead of

xmldoc = minidom.parse(allxml_string)

use

xmldoc = minidom.parseString(allxml_string)

The parse() function can take either a filename or an open file object.

If you have XML in a string, you can use the parseString() function instead:

Source: https://docs.python.org/2/library/xml.dom.minidom.html

answered Nov 5, 2015 at 18:07

Joe Young

5,9153 gold badges31 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dr. Brackish Okun Over a year ago

thank you me Young, the errormessages are getting less. I edited the issue above

rmunn · Accepted Answer · 2015-11-06 08:43:46Z

After you've read the XML into xml_string, you'll want to do something to the received text to close that unclosed <meta> tag you're seeing. While it's normally not a good idea to try parsing HTML with regular expressions, in this case a regular expression is probably your simplest solution. At the top of your code, add import re somewhere, then do the following:

fixed_xml_string = re.sub("<meta(.*?)>", "<meta\\1></meta>", xml_string)

Then try parsing fixed_xml_string.

If that works, great. If it doesn't work and there are more errors, then instead of fixing them one at a time, you'll be better off using a "generous" XML parser like BeautifulSoup instead of minidom. BeautifulSoup tries very hard not to give you errors when it encounters bad XML (or HTML), and instead figure out what the XML's author intended and give you that. It might guess wrong, which is why using a "strict" parser is better if you can -- but if you're dealing with malformed XML and just want to get your work done rather than fix someone else's broken code, that's exactly what BeautifulSoup was written for.

I won't spell out how to use BeautifulSoup, since its own documentation is pretty good. But if you use it and run into trouble, come back to StackOverflow and ask a second question.

Dr. Brackish Okun · Accepted Answer · 2015-11-06 11:57:02Z

-1

I managed to solve the issue by extracting the "xml only" out of the html/xml mixture.

re_xml = raw_xml[137:2938]

thanks for your input

answered Nov 6, 2015 at 11:57

Dr. Brackish Okun

391 silver badge5 bronze badges

1 Comment

Ira Baxter Over a year ago

That may work in exactly your one circumstance, but that means it is utterly useless as a general or good answer. @rmunn's answer is far better.

Collectives™ on Stack Overflow

XML parsing from URL with python

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related