73

I'd like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.

If, for example, I had an HTML table with three columns (marked by header tags), "Event", "Start Date", and "End Date" and that table had 5 entries, I would like to parse through that table to get back a list of length 5 where each element is a dictionary with keys "Event", "Start Date", and "End Date".

Thanks for the help!

4 Answers 4

90

You should use some HTML parsing library like lxml:

from lxml import etree
s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print dict(zip(headers, values))

prints

{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}
Sign up to request clarification or add additional context in comments.

8 Comments

My table has a varying number of rows. How can I make it work if this is the case? Thanks for the response, btw.
@Andrew: The above code works for any number of rows and any number of columns, as long as every row has the same number of columns.
I'd suggest HTMLParser/html.parser, but this solution is much better in this case.
This was a useful pointer for additional research. I actually have some broken HTML to parse, so some other answers involving lxml.html also proved useful.
it fails if html contains unquoted attrs like "<table align=center" with lxml.etree.XMLSyntaxError: AttValue: " or ' expected
|
79

Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.

import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest

As of pandas version 1.5.0, read_html() can preserve hyperlinks with the extract_links argument. Table elements will be tuples.

6 Comments

Not a good way for tables containing rowspan and colspan!
@JohnStrood Looking forward to reading your answer on how to handle rowspan and colspan 👍
@tommy.carstensen Ah! I used bs4 to build an element tree, and traversed through the elements to break row-spanned column-spanned cells into constituent cells.
@tommy.carstensen There are already answers here: stackoverflow.com/a/39336433/5337834 and stackoverflow.com/a/9980393/5337834. If you're still unsatisfied, I'll write my own answer!
@zelusp I just learned, that Pandas is extremely slow, if your html has 100+ tables and you just want a single table with a specific id. Beautifulsoup is much faster in this case.
|
35

Sven Marnach excellent solution is directly translatable into ElementTree which is part of recent Python distributions:

from xml.etree import ElementTree as ET

s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""

table = ET.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print(dict(zip(headers, values)))

same output as Sven Marnach's answer...

3 Comments

+1 because it allows using cElementTree instead of ElementTree which is considerably faster than lxml if large number of tables are involved
I have a web page saved from wikipedia. How can I specify to ET which table to parse and fetch data ? Is it possible by table name or table id ?
also, <tbody> and <thead> don't work. see stackoverflow.com/q/49286753/8929814
23

If the HTML is not XML you can't do it with etree. But even then, you don't have to use an external library for parsing a HTML table. In python 3 you can reach your goal with HTMLParser from html.parser. I've the code of the simple derived HTMLParser class here in a github repo.

You can use that class (here named HTMLTableParser) the following way:

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)

The output of this is a list of 2D-lists representing tables. It looks maybe like this:

[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]

3 Comments

Awesome parser !!
neat indeed. It will break if some td have a colspan though
@mr.bjerre PR welcome ;-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.