0

I am trying to parse a xml from an url.

So originaly my code looked like this:

from xml.dom import minidom                                          
xmldoc = minidom.parse('all.xml')  

Sensor0Elm = xmldoc.getElementsByTagName('t0')
Sensor1Elm = xmldoc.getElementsByTagName('t1')
Sensor2Elm = xmldoc.getElementsByTagName('t2')

Sensor0Elm = Sensor0Elm[0]
Sensor1Elm = Sensor1Elm[0]
Sensor2Elm = Sensor2Elm[0]

Sensor0 = Sensor0Elm.childNodes[0].data
Sensor1 = Sensor1Elm.childNodes[0].data
Sensor2 = Sensor2Elm.childNodes[0].data

Sensor0 = float(Sensor0)
Sensor1 = float(Sensor1)
Sensor2 = float(Sensor2)

In this case the xml what I intend to parse was on my local harddrive and it worked out perfectly!

The next step was to parse a xml from an url. A sensormeter from allnet constantly inserts xml data into the networkt which is over the following url with a browser accessible: 192.168.60.242/xml

this is the embedded xml:

<HTML><HEAD><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></HEAD><BODY><FORM><TEXTAREA COLS=132 ROWS=50><xml><data>
<devicename>ALL4000</devicename>
<n0>0</n0><t0> 1.27</t0><min0> 0.00</min0><max0> 2.55</max0><l0>-55</l0><h0>150</h0><s0>102</s0>
<n1>1</n1><t1> 2.53</t1><min1> 2.32</min1><max1> 10487.04</max1><l1>-55</l1><h1>150</h1><s1>102</s1>
<n2>2</n2><t2> 2.45</t2><min2> 0.00</min2><max2> 2.55</max2><l2>-55</l2><h2>150</h2><s2>102</s2>
<n3>3</n3><t3>-20480.00</t3><min3> 0.00</min3><max3> 5580.80</max3><l3>-55</l3><h3>150</h3><s3>0</s3>
<n4>4</n4><t4>-20480.00</t4><min4> 40.96</min4><max4> 41943.04</max4><l4>-55</l4><h4>150</h4><s4>0</s4>
<n5>5</n5><t5>-20480.00</t5><min5> 10.24</min5><max5> 0.08</max5><l5>-55</l5><h5>150</h5><s5>0</s5>
<n6>6</n6><t6>-20480.00</t6><min6> 0.00</min6><max6>-20480.00</max6><l6>-55</l6><h6>150</h6><s6>0</s6>
<n7>7</n7><t7>-20480.00</t7><min7> 0.00</min7><max7> 0.00</max7><l7>-55</l7><h7>150</h7><s7>0</s7>
<n8>8</n8><t8>-20480.00</t8><min8> 336855.04</min8><max8> 1342177.28</max8><l8>-55</l8><h8>150</h8><s8>0</s8>
<n9>9</n9><t9>-20480.00</t9><min9> 0.00</min9><max9> 0.00</max9><l9>-55</l9><h9>150</h9><s9>0</s9>
<n10>10</n10><t10>-20480.00</t10><min10> 0.00</min10><max10> 0.00</max10><l10>-55</l10><h10>150</h10><s10>0</s10>
<n11>11</n11><t11>-20480.00</t11><min11> 0.00</min11><max11> 0.00</max11><l11>-55</l11><h11>150</h11><s11>0</s11>
<n12>12</n12><t12>-20480.00</t12><min12> 0.00</min12><max12> 0.00</max12><l12>-55</l12><h12>150</h12><s12>0</s12>
<n13>13</n13><t13>-20480.00</t13><min13> 0.00</min13><max13> 0.00</max13><l13>-55</l13><h13>150</h13><s13>0</s13>
<n14>14</n14><t14>-20480.00</t14><min14> 0.00</min14><max14> 0.00</max14><l14>-55</l14><h14>150</h14><s14>0</s14>
<n15>15</n15><t15>-20480.00</t15><min15> 0.00</min15><max15> 0.00</max15><l15>-55</l15><h15>150</h15><s15>0</s15>
<fn0>1</fn0><ft0>0</ft0><fs0>0</fs0>
<fn1>2</fn1><ft1>0</ft1><fs1>0</fs1>
<fn2>3</fn2><ft2>0</ft2><fs2>0</fs2>
<fn3>4</fn3><ft3>0</ft3><fs3>0</fs3>
<fn4>5</fn4><ft4>0</ft4><fs4>0</fs4>
<fn5>6</fn5><ft5>0</ft5><fs5>0</fs5>
<fn6>7</fn6><ft6>0</ft6><fs6>0</fs6>
<fn7>8</fn7><ft7>0</ft7><fs7>0</fs7>
<fn8>9</fn8><ft8>0</ft8><fs8>0</fs8>
<fn9>10</fn9><ft9>0</ft9><fs9>0</fs9>
<fn10>11</fn10><ft10>0</ft10><fs10>0</fs10>
<fn11>12</fn11><ft11>0</ft11><fs11>0</fs11>
<fn12>13</fn12><ft12>0</ft12><fs12>0</fs12>
<fn13>14</fn13><ft13>0</ft13><fs13>0</fs13>
<fn14>15</fn14><ft14>0</ft14><fs14>0</fs14>
<fn15>16</fn15><ft15>0</ft15><fs15>0</fs15>
<rn0>0</rn0><rt0>0</rt0>
<rn1>1</rn1><rt1>0</rt1>
<rn2>2</rn2><rt2>0</rt2>
<rn3>3</rn3><rt3>0</rt3>
<it0>248</it0><it1>254</it1><it2>255</it2><it3>255</it3><it4>128</it4><it5>1</it5><it6>255</it6><it7>255</it7>
<date>06.08.2006</date><time>03:27:49</time><ad>1</ad><ntpsync>-1</ntpsync><i>10</i><f>0</f>
<sys>18844128</sys><mem>25048</mem><fw>2.89</fw><dev>ALL4000</dev>
<sensorx>5</sensorx><sensory>3</sensory>
</data></xml>
</TEXTAREA></FORM></BODY></HTML>

So I changed the code into this:

import urllib
import time

while True:

### XML Extraction ###
from xml.dom import minidom

allxml = urllib.urlopen("http://192.168.60.242/xml")
allxml_string = allxml.read()
allxml.close()
print allxml_string

xmldoc = minidom.parseString(allxml_string)



Sensor0Elm = xmldoc.getElementsByTagName('t0')
Sensor1Elm = xmldoc.getElementsByTagName('t1')
Sensor2Elm = xmldoc.getElementsByTagName('t2')

Sensor0Elm = Sensor0Elm[0]
Sensor1Elm = Sensor1Elm[0]
Sensor2Elm = Sensor2Elm[0]

Sensor0 = Sensor0Elm.childNodes[0].data
Sensor1 = Sensor1Elm.childNodes[0].data
Sensor2 = Sensor2Elm.childNodes[0].data

Sensor0 = float(Sensor0)
Sensor1 = float(Sensor1)
Sensor2 = float(Sensor2)

Unfortunatelly it does not work. If executed, This is what gets returned: (by using the function print(), the xml is correctly inserted into te programm. the only problem seems to be a proper further processing by the parse function.) PLEASE LOOK AT THE ERROR MESSAGE ON THE BOTTOM

    Python 2.7.3 (default, Mar 18 2014, 05:13:23)
    [GCC 4.6.3] on linux2
    Type "copyright", "credits" or "license()" for more information.
    >>> ================================ RESTART ================================
    >>>
    <HTML><HEAD><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></HEAD><BODY><FORM><TEXTAREA COLS=132 ROWS=50><xml><data>
    <devicename>ALL4000</devicename>
    <n0>0</n0><t0> 1.09</t0><min0> 0.00</min0><max0> 2.55</max0><l0>-55</l0><h0>150</h0><s0>102</s0>
    <n1>1</n1><t1> 2.52</t1><min1> 2.32</min1><max1> 10487.04</max1><l1>-55</l1><h1>150</h1><s1>102</s1>
    <n2>2</n2><t2> 2.45</t2><min2> 0.00</min2><max2> 2.55</max2><l2>-55</l2><h2>150</h2><s2>102</s2>
    <n3>3</n3><t3>-20480.00</t3><min3> 0.00</min3><max3> 5580.80</max3><l3>-55</l3><h3>150</h3><s3>0</s3>
    <n4>4</n4><t4>-20480.00</t4><min4> 40.96</min4><max4> 41943.04</max4><l4>-55</l4><h4>150</h4><s4>0</s4>
    <n5>5</n5><t5>-20480.00</t5><min5> 10.24</min5><max5> 0.08</max5><l5>-55</l5><h5>150</h5><s5>0</s5>
    <n6>6</n6><t6>-20480.00</t6><min6> 0.00</min6><max6>-20480.00</max6><l6>-55</l6><h6>150</h6><s6>0</s6>
    <n7>7</n7><t7>-20480.00</t7><min7> 0.00</min7><max7> 0.00</max7><l7>-55</l7><h7>150</h7><s7>0</s7>
    <n8>8</n8><t8>-20480.00</t8><min8> 336855.04</min8><max8> 1342177.28</max8><l8>-55</l8><h8>150</h8><s8>0</s8>
    <n9>9</n9><t9>-20480.00</t9><min9> 0.00</min9><max9> 0.00</max9><l9>-55</l9><h9>150</h9><s9>0</s9>
    <n10>10</n10><t10>-20480.00</t10><min10> 0.00</min10><max10> 0.00</max10><l10>-55</l10><h10>150</h10><s10>0</s10>
    <n11>11</n11><t11>-20480.00</t11><min11> 0.00</min11><max11> 0.00</max11><l11>-55</l11><h11>150</h11><s11>0</s11>
    <n12>12</n12><t12>-20480.00</t12><min12> 0.00</min12><max12> 0.00</max12><l12>-55</l12><h12>150</h12><s12>0</s12>
    <n13>13</n13><t13>-20480.00</t13><min13> 0.00</min13><max13> 0.00</max13><l13>-55</l13><h13>150</h13><s13>0</s13>
    <n14>14</n14><t14>-20480.00</t14><min14> 0.00</min14><max14> 0.00</max14><l14>-55</l14><h14>150</h14><s14>0</s14>
    <n15>15</n15><t15>-20480.00</t15><min15> 0.00</min15><max15> 0.00</max15><l15>-55</l15><h15>150</h15><s15>0</s15>
    <fn0>1</fn0><ft0>0</ft0><fs0>0</fs0>
    <fn1>2</fn1><ft1>0</ft1><fs1>0</fs1>
    <fn2>3</fn2><ft2>0</ft2><fs2>0</fs2>
    <fn3>4</fn3><ft3>0</ft3><fs3>0</fs3>
    <fn4>5</fn4><ft4>0</ft4><fs4>0</fs4>
    <fn5>6</fn5><ft5>0</ft5><fs5>0</fs5>
    <fn6>7</fn6><ft6>0</ft6><fs6>0</fs6>
    <fn7>8</fn7><ft7>0</ft7><fs7>0</fs7>
    <fn8>9</fn8><ft8>0</ft8><fs8>0</fs8>
    <fn9>10</fn9><ft9>0</ft9><fs9>0</fs9>
    <fn10>11</fn10><ft10>0</ft10><fs10>0</fs10>
    <fn11>12</fn11><ft11>0</ft11><fs11>0</fs11>
    <fn12>13</fn12><ft12>0</ft12><fs12>0</fs12>
    <fn13>14</fn13><ft13>0</ft13><fs13>0</fs13>
    <fn14>15</fn14><ft14>0</ft14><fs14>0</fs14>
    <fn15>16</fn15><ft15>0</ft15><fs15>0</fs15>
    <rn0>0</rn0><rt0>0</rt0>
    <rn1>1</rn1><rt1>0</rt1>
    <rn2>2</rn2><rt2>0</rt2>
    <rn3>3</rn3><rt3>0</rt3>
    <it0>248</it0><it1>254</it1><it2>255</it2><it3>255</it3><it4>128</it4><it5>1</it5><it6>255</it6><it7>255</it7>
    <date>06.08.2006</date><time>06:45:46</time><ad>1</ad><ntpsync>-1</ntpsync><i>10</i><f>0</f>
    <sys>18856004</sys><mem>25048</mem><fw>2.89</fw><dev>ALL4000</dev>
    <sensorx>5</sensorx><sensory>3</sensory>
    </data></xml>
    </TEXTAREA></FORM></BODY></HTML>

  Traceback (most recent call last):
File "/home/pi/Desktop/sig_v3.py", line 14, in <module>
xmldoc = minidom.parseString(allxml_string)
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1930, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
ExpatError: mismatched tag: line 1, column 86

I hope someone can help me out on this.

thanks

3
  • 1
    See how the <meta> tag isn't closed? That's your mismatched tag. If that URL always gives you malformed XML, you'll have to edit it (manually add the text </meta> in the right place) before it will parse. Commented Nov 6, 2015 at 6:57
  • ... or change the meta phrase to look like "<meta .... />". I suspect this is what the XML author intended. Commented Nov 6, 2015 at 7:31
  • Any suggestions how to accomplish that? Its my fthird day with python. Some code would be nice Commented Nov 6, 2015 at 7:40

3 Answers 3

1

Instead of

xmldoc = minidom.parse(allxml_string)

use

xmldoc = minidom.parseString(allxml_string)

The parse() function can take either a filename or an open file object.

If you have XML in a string, you can use the parseString() function instead:

Source: https://docs.python.org/2/library/xml.dom.minidom.html

Sign up to request clarification or add additional context in comments.

1 Comment

thank you me Young, the errormessages are getting less. I edited the issue above
1

After you've read the XML into xml_string, you'll want to do something to the received text to close that unclosed <meta> tag you're seeing. While it's normally not a good idea to try parsing HTML with regular expressions, in this case a regular expression is probably your simplest solution. At the top of your code, add import re somewhere, then do the following:

fixed_xml_string = re.sub("<meta(.*?)>", "<meta\\1></meta>", xml_string)

Then try parsing fixed_xml_string.

If that works, great. If it doesn't work and there are more errors, then instead of fixing them one at a time, you'll be better off using a "generous" XML parser like BeautifulSoup instead of minidom. BeautifulSoup tries very hard not to give you errors when it encounters bad XML (or HTML), and instead figure out what the XML's author intended and give you that. It might guess wrong, which is why using a "strict" parser is better if you can -- but if you're dealing with malformed XML and just want to get your work done rather than fix someone else's broken code, that's exactly what BeautifulSoup was written for.

I won't spell out how to use BeautifulSoup, since its own documentation is pretty good. But if you use it and run into trouble, come back to StackOverflow and ask a second question.

Comments

-1

I managed to solve the issue by extracting the "xml only" out of the html/xml mixture.

re_xml = raw_xml[137:2938]

thanks for your input

1 Comment

That may work in exactly your one circumstance, but that means it is utterly useless as a general or good answer. @rmunn's answer is far better.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.