1

I'm trying to parse a web page that contains this:

<table style="width: 100%; border-top: 1px solid black; border-bottom: 1px solid black;">
<tr>
 <td colspan="2"
     style="border-top: 1px solid black; border-bottom: 1px solid black; background-color: #f0ffd3;">February 20, 2015</td>
</tr>
<tr>
 <td style="border-top: 1px solid gray; font-weight: bold;">9:00 PM</td>
 <td style="border-top: 1px solid gray; font-weight: bold">14°F</td>
</tr>
<tr>
 <td style="border-bottom: 1px solid gray;">Clear<br />
  Precip:
  0 %<br />
                                Wind:
                    from the WSW at 6 mph
 </td>
 <td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
       style="border: 0px; padding: 0px 3px" /></td>
</tr>
<tr>
 <td style="border-top: 1px solid gray; font-weight: bold;">10:00 PM</td>
 <td style="border-top: 1px solid gray; font-weight: bold">13°F</td>
</tr>
<tr>
 <td style="border-bottom: 1px solid gray;">Clear<br />
  Precip:
  0 %<br />
                                Wind:
                    from the WSW at 6 mph
 </td>
 <td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
       style="border: 0px; padding: 0px 3px" /></td>
</tr>

(it continues with more rows and ends with [/table]

tree = html.fromstring(page)
table = tree.xpath('//table/tr')
for item in table:
    for elem in item.xpath('*'):
        if 'colspan' in html.tostring(elem):
                print '*', elem.text
        elif elem.text is not None:
            print elem.text,
        else:
            print 

somewhat works. It does not get the text following the [br /] and it's far from elegant. How do I get the missing text? In addition, any suggestions for improving the code would be appreciated.

1 Answer 1

2

How about using .text_content()?

.text_content(): Returns the text content of the element, including the text content of its children, with no markup.

table = tree.xpath('//table/tr')
for item in table:
    print ' '.join(item.text_content().split())

join()+split() here help to replace multiple spaces with a single one.

It prints:

February 20, 2015
9:00 PM 14°F
Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13°F
Clear Precip: 0 % Wind: from the WSW at 6 mph

Since you want to merge time-line with a precip-line, you can iterate over tr tags but skipping those containing Precip in the text. For every time-line, get the following tr sibling to get the precip-line:

table = tree.xpath('//table/tr[not(contains(., "Precip"))]')
for item in table:
    text = ' '.join(item.text_content().split())
    if 'AM' in text or 'PM' in text:
        text += ' ' + ' '.join(item.xpath('following-sibling::tr')[0].text_content().split())

    print text

Prints:

February 20, 2015
9:00 PM 14°F Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13°F Clear Precip: 0 % Wind: from the WSW at 6 mph
Sign up to request clarification or add additional context in comments.

6 Comments

Much nicer! Is there a good way identify if a line is a the date line, time line or other line (using xpath, not parsing the contents)? If nothing else, I'd like to merge each time line with its clear precip line.
@foosion for a date line - I would follow the EAFP principle and try to load the contents with datetime.strptime() and handle ValueError - if no error - it is a date line. For a time line I think you can just search for PM or AM word inside the contents. Looks like other lines start with "Clear Precip"..
@foosion let me provide you with a sample, give me a minute.
alecxe I know how to do that. I was hoping there was a way to do it with xpath rather than parsing the text to see if it's a date or a time or other. For example, the dates are part of [td colspan="2"]
@foosion as for a date line, you are right - we can check if there is a td child with colspan="2", like this: if item.xpath('.//td[@colspan="2"]'):
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.