Parsing xpath with python

Question

I'm trying to parse a web page that contains this:

<table style="width: 100%; border-top: 1px solid black; border-bottom: 1px solid black;">
<tr>
 <td colspan="2"
     style="border-top: 1px solid black; border-bottom: 1px solid black; background-color: #f0ffd3;">February 20, 2015</td>
</tr>
<tr>
 <td style="border-top: 1px solid gray; font-weight: bold;">9:00 PM</td>
 <td style="border-top: 1px solid gray; font-weight: bold">14°F</td>
</tr>
<tr>
 <td style="border-bottom: 1px solid gray;">Clear<br />
  Precip:
  0 %<br />
                                Wind:
                    from the WSW at 6 mph
 </td>
 <td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
       style="border: 0px; padding: 0px 3px" /></td>
</tr>
<tr>
 <td style="border-top: 1px solid gray; font-weight: bold;">10:00 PM</td>
 <td style="border-top: 1px solid gray; font-weight: bold">13°F</td>
</tr>
<tr>
 <td style="border-bottom: 1px solid gray;">Clear<br />
  Precip:
  0 %<br />
                                Wind:
                    from the WSW at 6 mph
 </td>
 <td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
       style="border: 0px; padding: 0px 3px" /></td>
</tr>

(it continues with more rows and ends with [/table]

tree = html.fromstring(page)
table = tree.xpath('//table/tr')
for item in table:
    for elem in item.xpath('*'):
        if 'colspan' in html.tostring(elem):
                print '*', elem.text
        elif elem.text is not None:
            print elem.text,
        else:
            print

somewhat works. It does not get the text following the [br /] and it's far from elegant. How do I get the missing text? In addition, any suggestions for improving the code would be appreciated.

alecxe · Accepted Answer · 2015-02-21 02:52:24Z

2

How about using .text_content()?

.text_content(): Returns the text content of the element, including the text content of its children, with no markup.

table = tree.xpath('//table/tr')
for item in table:
    print ' '.join(item.text_content().split())

join()+split() here help to replace multiple spaces with a single one.

It prints:

February 20, 2015
9:00 PM 14Â°F
Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13Â°F
Clear Precip: 0 % Wind: from the WSW at 6 mph

Since you want to merge time-line with a precip-line, you can iterate over tr tags but skipping those containing Precip in the text. For every time-line, get the following tr sibling to get the precip-line:

table = tree.xpath('//table/tr[not(contains(., "Precip"))]')
for item in table:
    text = ' '.join(item.text_content().split())
    if 'AM' in text or 'PM' in text:
        text += ' ' + ' '.join(item.xpath('following-sibling::tr')[0].text_content().split())

    print text

Prints:

February 20, 2015
9:00 PM 14Â°F Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13Â°F Clear Precip: 0 % Wind: from the WSW at 6 mph

edited Feb 21, 2015 at 2:52

answered Feb 21, 2015 at 2:28

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

foosion Over a year ago

Much nicer! Is there a good way identify if a line is a the date line, time line or other line (using xpath, not parsing the contents)? If nothing else, I'd like to merge each time line with its clear precip line.

alecxe Over a year ago

@foosion for a date line - I would follow the EAFP principle and try to load the contents with datetime.strptime() and handle ValueError - if no error - it is a date line. For a time line I think you can just search for PM or AM word inside the contents. Looks like other lines start with "Clear Precip"..

alecxe Over a year ago

@foosion let me provide you with a sample, give me a minute.

foosion Over a year ago

alecxe I know how to do that. I was hoping there was a way to do it with xpath rather than parsing the text to see if it's a date or a time or other. For example, the dates are part of [td colspan="2"]

alecxe Over a year ago

@foosion as for a date line, you are right - we can check if there is a td child with colspan="2", like this: if item.xpath('.//td[@colspan="2"]'):

|

Collectives™ on Stack Overflow

Parsing xpath with python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related