0

I have a HTML file with the following data structure:

<tr>
    <td valign="top"><img src="img.jpg"></td>
    <td><a href="file.zip">file.zip</a></td>
    <td align="right">24-Apr-2013 12:42 </td>
    <td align="right">200K</td>
</tr>
...

It's basically a simple table and when viewed in Firefox it looks like this:

file.zip   22-Apr-2013 12:42   200K

I want to extract this three values (file name, date, size) and I could do it e.g. with split() but I am wondering if it is possible to print "the html interpreted form" of this in python?

import xyz
print xyz.htmlinterpreted(htmlfile.html)
>>> file.zip   22-Apr-2013 12:42   200K

That way I could easiely split the data with split(" "). Is this possible in python?

1 Answer 1

1

Use a HTML parser. BeautifulSoup makes this a breaze:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_source)
print list(soup.stripped_strings)

Demo:

>>> from bs4 import BeautifulSoup                                                                                                   >>> soup = BeautifulSoup('''<tr><td valign="top"><img src="img.jpg"></td><td><a href="file.zip">file.zip</a></td><td align="right">24-Apr-2013 12:42 </td><td align="right">200K</td></tr>''')
>>> print list(soup.stripped_strings)
[u'file.zip', u'24-Apr-2013 12:42', u'200K']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.