2

I am trying to parse value form html using python with lxml and xpath.

Here is my html data

<table>
<tr>
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain1.com"></td>
      <td class="u">
       <select name="record[13][type]">
         <option SELECTED value="A" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.1'></td>

<td class="u"><input class="wide" name="record[14][name]" value="exampledomain2.com"></td>
      <td class="u">
       <select name="record[14][type]">
         <option SELECTED value="CNAME" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[14][content]" value='exampledomain1.com'></td>

<td class="u"><input class="wide" name="record[15][name]" value="exampledomain3.com"></td>
      <td class="u">
       <select name="record[15][type]">
         <option SELECTED value="A" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[15][content]" value='10.10.10.3'></td>
</tr>
</table>

what I want is to parse value and print as below:

exampledomain1.com A 10.10.10.1
exampledomain2.com CNAME exampledomain1.com
exampledomain3.com A 10.10.10.3

Here is what I tried

#!/usr/bin/python
import lxml.html
from lxml import etree

doc = lxml.html.document_fromstring("""Here whole html data""")
txt1 = doc.xpath('//*[@class="wide"]/@value')
txt2 = doc.xpath('//@SELECTED/text()')
print txt1
print txt2

But its not working as I wanted. Any help would be appreciated.

Thank You all.

3
  • 4
    Running 'xmllint --noout' on your HTML report 7 errors. You should fix them before trying to parse it. Commented Jul 31, 2012 at 16:33
  • How is it 'not working as [you] wanted'? Commented Jul 31, 2012 at 17:11
  • 1
    use BeautifulSoup.. Its simple and easy Commented Aug 1, 2012 at 14:55

2 Answers 2

3

I fixed the code to return the following, which is very close to what you asked for:

(py26_default)[mpenning@Bucksnort ~]$ python parse.py
exampledomain1.com 10.10.10.1
exampledomain2.com exampledomain1.com
exampledomain3.com 10.10.10.3
(py26_default)[mpenning@Bucksnort ~]$

You cannot retrieve record[13][type] with xpath... there are other ways to iterate through this, but I will leave this as an exercise for the OP. Note that I did fix the HTML in the OP's question to include <table> and <tr> tags...

import lxml.html
from lxml import etree
from lxml.etree import XMLParser

parser = XMLParser(ns_clean=True, recover=True)
doc = etree.fromstring("""Here whole html data""", parser)
elem1 = doc.xpath('//input[@name="record[13][name]"]')
# NOTE: <option SELECTED> cannot be retrieved with xpath... SELECTED must have
#   a value to do so...
#elem2 = doc.xpath('//select[@name="record[13][type]"]/option[@SELECTED]')
elem3 = doc.xpath('//input[@name="record[13][content]"]')

for idx, val in enumerate(elem1):
    print val.attrib['value'], elem3[idx].attrib['value']

<!-- The (fixed) html source I used -->
<table>
<tr>
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain1.com"></td>
      <td class="u">
       <select name="record[13][type]">
         <option SELECTED value="A" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.1'></td>

<td class="u"><input class="wide" name="record[13][name]" value="exampledomain2.com"></td>
      <td class="u">
       <select name="record[13][type]">
         <option SELECTED value="CNAME" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[13][content]" value='exampledomain1.com'></td>

<td class="u"><input class="wide" name="record[13][name]" value="exampledomain3.com"></td>
      <td class="u">
       <select name="record[13][type]">
         <option SELECTED value="A" >A</option>
         <option value="AAAA" >AAAA</option>
         <option value="CNAME" >CNAME</option>
         <option value="HINFO" >HINFO</option>
         <option value="MX" >MX</option>
         <option value="NAPTR" >NAPTR</option>
         <option value="NS" >NS</option>
         <option value="PTR" >PTR</option>
         <option value="SOA" >SOA</option>
         <option value="SPF" >SPF</option>
         <option value="SRV" >SRV</option>
         <option value="SSHFP" >SSHFP</option>
         <option value="TXT" >TXT</option>
         <option value="RP" >RP</option>
         <option value="URL" >URL</option>
         <option value="MBOXFW" >MBOXFW</option>
         <option value="CURL" >CURL</option>
       </select>
      </td>
      <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.3'></td>
</tr>
</table>
Sign up to request clarification or add additional context in comments.

6 Comments

Hi Mike, the field "name="record[13]" is changing for all of those other dns records records, which I have corrected in this html code. So in this case the //input[@name="record[13][name]"]' will not catch all the record with different numbers. So how I can define wildcard in it or range.
You could use an lxml regex to solve this problem
Thank You Mike, Well I got that working with regex but still stuck on getting SELECTED value.
I am talking about this code <select name="record[13][type]"> <option SELECTED value="A" >A</option> where value is "A" which should print that "SELECTED value" as "A" and the whole output should be like exampledomain1.com A 10.10.10.1
I have helped enough. This answer demonstrates how to solve your problem; however, I cannot solve all the problems and this is part of your job. You need to rise to the challenge
|
0
record_13_name = tree.xpath("//select[@name='record[13][name]']/text()")
record_13_type = tree.xpath("//select[@name='record[13][type]']/option/text()")
record_13_content = tree.xpath("//input[@name='record[13][content]']/text()")


record_14_name = tree.xpath("//select[@name='record[14][name]']/text()")
record_14_type = tree.xpath("//select[@name='record[14][type]']/option/text()")
record_14_content = tree.xpath("//input[@name='record[14][content]']/text()")


record_15_name = tree.xpath("//select[@name='record[15][name]']/text()")
record_15_type = tree.xpath("//select[@name='record[15][type]']/option/text()")
record_15_content = tree.xpath("//input[@name='record[15][content]']/text()") 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.