Parse HTML table to Python list?

Question

I'd like to take an HTML table and parse through it to get a list of dictionaries. Each list element would be a dictionary corresponding to a row in the table.

If, for example, I had an HTML table with three columns (marked by header tags), "Event", "Start Date", and "End Date" and that table had 5 entries, I would like to parse through that table to get back a list of length 5 where each element is a dictionary with keys "Event", "Start Date", and "End Date".

Thanks for the help!

Sven Marnach · Accepted Answer · 2017-02-22 19:14:44Z

90

You should use some HTML parsing library like lxml:

from lxml import etree
s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print dict(zip(headers, values))

prints

{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}

edited Feb 22, 2017 at 19:14

answered Jun 12, 2011 at 22:59

Sven Marnach

608k123 gold badges966 silver badges865 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Andrew Over a year ago

My table has a varying number of rows. How can I make it work if this is the case? Thanks for the response, btw.

Sven Marnach Over a year ago

@Andrew: The above code works for any number of rows and any number of columns, as long as every row has the same number of columns.

Jasmijn Over a year ago

I'd suggest HTMLParser/html.parser, but this solution is much better in this case.

Rob Fagen Over a year ago

This was a useful pointer for additional research. I actually have some broken HTML to parse, so some other answers involving lxml.html also proved useful.

Maxdestroyer Over a year ago

it fails if html contains unquoted attrs like "<table align=center" with lxml.etree.XMLSyntaxError: AttValue: " or ' expected

|

davidshere · Accepted Answer · 2024-02-19 08:31:53Z

79

Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.

import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest

As of pandas version 1.5.0, read_html() can preserve hyperlinks with the extract_links argument. Table elements will be tuples.

edited Feb 19, 2024 at 8:31

davidshere

4061 gold badge4 silver badges14 bronze badges

answered Jul 14, 2017 at 23:48

zelusp

3,7283 gold badges35 silver badges69 bronze badges

6 Comments

John Strood Over a year ago

Not a good way for tables containing rowspan and colspan!

tommy.carstensen Over a year ago

@JohnStrood Looking forward to reading your answer on how to handle rowspan and colspan 👍

John Strood Over a year ago

@tommy.carstensen Ah! I used bs4 to build an element tree, and traversed through the elements to break row-spanned column-spanned cells into constituent cells.

John Strood Over a year ago

@tommy.carstensen There are already answers here: stackoverflow.com/a/39336433/5337834 and stackoverflow.com/a/9980393/5337834. If you're still unsatisfied, I'll write my own answer!

tommy.carstensen Over a year ago

@zelusp I just learned, that Pandas is extremely slow, if your html has 100+ tables and you just want a single table with a specific id. Beautifulsoup is much faster in this case.

|

Hugo · Accepted Answer · 2020-10-03 08:43:19Z

35

Sven Marnach excellent solution is directly translatable into ElementTree which is part of recent Python distributions:

from xml.etree import ElementTree as ET

s = """<table>
  <tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
  <tr><td>a</td><td>b</td><td>c</td></tr>
  <tr><td>d</td><td>e</td><td>f</td></tr>
  <tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""

table = ET.XML(s)
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
    values = [col.text for col in row]
    print(dict(zip(headers, values)))

same output as Sven Marnach's answer...

edited Oct 3, 2020 at 8:43

Hugo

29.8k9 gold badges87 silver badges102 bronze badges

answered Sep 6, 2011 at 6:46

user688635

3 Comments

Cerno Over a year ago

+1 because it allows using cElementTree instead of ElementTree which is considerably faster than lxml if large number of tables are involved

Massimo Over a year ago

I have a web page saved from wikipedia. How can I specify to ET which table to parse and fetch data ? Is it possible by table name or table id ?

CopyPasteIt Over a year ago

also, <tbody> and <thead> don't work. see stackoverflow.com/q/49286753/8929814

schmijos · Accepted Answer · 2014-03-11 08:31:49Z

23

If the HTML is not XML you can't do it with etree. But even then, you don't have to use an external library for parsing a HTML table. In python 3 you can reach your goal with HTMLParser from html.parser. I've the code of the simple derived HTMLParser class here in a github repo.

You can use that class (here named HTMLTableParser) the following way:

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)

The output of this is a list of 2D-lists representing tables. It looks maybe like this:

[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]

answered Mar 11, 2014 at 8:31

schmijos

8,8434 gold badges54 silver badges60 bronze badges

3 Comments

Naive Over a year ago

Awesome parser !!

mr.bjerre Over a year ago

neat indeed. It will break if some td have a colspan though

schmijos Over a year ago

@mr.bjerre PR welcome ;-)

Collectives™ on Stack Overflow

Parse HTML table to Python list?

4 Answers 4

8 Comments

6 Comments

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

8 Comments

6 Comments

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related