How to get the content from a certain <table> using python?

Question

I have some <tr>s, like this:

<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>

I want to fetch the content without html tags, like:

yangfanhit
3155
Accepted
344K
219MS
C++
3940B
2012-10-02 16:42:45

Now I'm using the following code to deal with it:

response = urllib2.urlopen('http://poj.org/status', timeout=10)
html = response.read()
response.close()

pattern = re.compile(r'<tr align.*</tr>')
match = pattern.findall(html)
pat = re.compile(r'<td>.*?</td>')
p = re.compile(r'<[/]?.*?>')
for item in match:
    for i in pat.findall(item):
        print p.sub(r'', i)
    print '================================================='

I'm new to regex and also new to python. So could you suggest some better methods to process it?

possible duplicate of RegEx match open tags except XHTML self-contained tags — Chinmay Kanchi
– Chinmay Kanchi, Commented Oct 2, 2012 at 12:38
Don't parse HTML with RegEx. Tony the Pony will eat you alive. Please use a proper parser instead. lxml comes built in to Python. — Chinmay Kanchi
– Chinmay Kanchi, Commented Oct 2, 2012 at 12:40

jfs · Accepted Answer · 2012-10-02 12:51:48Z

1

You could use BeautifulSoup to parse the html. To write the content of the table in csv format:

#!/usr/bin/env python
import csv
import sys
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4

soup = BeautifulSoup(urllib2.urlopen('http://poj.org/status'))

writer = csv.writer(sys.stdout)
for tr in soup.find('table', 'a')('tr'):
    writer.writerow([td.get_text() for td in tr('td')])

Output

Run ID,User,Problem,Result,Memory,Time,Language,Code Length,Submit Time
10876151,yangfanhit,3155,Accepted,344K,219MS,C++,3940B,2012-10-02 16:42:45
10876150,BandBandRock,2503,Accepted,16348K,2750MS,G++,840B,2012-10-02 16:42:25

answered Oct 2, 2012 at 12:51

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Bryan · Accepted Answer · 2012-10-02 15:45:44Z

1

Also take a look at PyQuery. Very easy to pickup if you're familiar with jQuery. Here's an example that returns table header and data as list of dictionaries.

import itertools
from pyquery import PyQuery as pq

# parse html
html = pq(url="http://poj.org/status")

# extract header values from table
header = [header.text for header in html(".a").find(".in").find("td")]

# extract data values from table rows in nested list
detail = [[td.text for td in tr] for tr in html(".a").children().not_(".in")]

# merge header and detail to create list of dictionaries
result = [dict(itertools.izip(header, values)) for values in detail]

edited Oct 2, 2012 at 15:45

answered Oct 2, 2012 at 14:02

Bryan

17.7k7 gold badges59 silver badges81 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:18:49Z

0

You really don't need to work with regex directly to parse html, see answer here.

Or see Dive into Python Chapter 8 about HTML Processing.

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Oct 2, 2012 at 12:37

oz123

29.1k30 gold badges133 silver badges196 bronze badges

Comments

Surya Kasturi · Accepted Answer · 2012-10-02 12:53:35Z

Why you are doing those things when you already got HTML/ XML parsers which does the job easily for you

Use BeautifulSoup. Considering what you want as mentioned in the above question, it can be done in 2-3 lines of code.

Example:

>>> from bs4 import BeautifulSoup as bs
>>> html = """
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
"""

>>>soup = bs(html)
>>>soup.td
>>><td>10876151</td>

Collectives™ on Stack Overflow

How to get the content from a certain <table> using python?

4 Answers 4

Output

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Output

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related