Error using read_html() in pandas

Question

This piece of code is giving me an error:

Code :

import pandas as pd

fiddy_states = pd.read_html("https://simple.wikipedia.org/wiki/List_of_U.S._states")

Error:

> ---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-9-87a39d7446f6> in <module>()
      1 import pandas as pd
----> 2 df_states = pd.read_html('http://www.50states.com/abbreviations.htm#.Vmz0ZkorLIU')

C:\Anaconda3\lib\site-packages\pandas\io\html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding)
    864     _validate_header_arg(header)
    865     return _parse(flavor, io, match, header, index_col, skiprows,
--> 866                   parse_dates, tupleize_cols, thousands, attrs, encoding)

C:\Anaconda3\lib\site-packages\pandas\io\html.py in _parse(flavor, io, match, header, index_col, skiprows, parse_dates, tupleize_cols, thousands, attrs, encoding)
    716     retained = None
    717     for flav in flavor:
--> 718         parser = _parser_dispatch(flav)
    719         p = parser(io, compiled_match, attrs, encoding)
    720 

C:\Anaconda3\lib\site-packages\pandas\io\html.py in _parser_dispatch(flavor)
    661     if flavor in ('bs4', 'html5lib'):
    662         if not _HAS_HTML5LIB:
--> 663             raise ImportError("html5lib not found, please install it")
    664         if not _HAS_BS4:
    665             raise ImportError("BeautifulSoup4 (bs4) not found, please install it")

ImportError: html5lib not found, please install it

Although I have html5lib, lxml and BeatifulSoup4 library installed and updated.

try to import html5lib in python console. Is it working fine? — Anton Protopopov
– Anton Protopopov, Commented Dec 3, 2015 at 9:57
@AntonProtopopov I tried but it says that also is giving me an error although I have installed it. Any ideas? — archimitra
– archimitra, Commented Dec 13, 2015 at 4:46

Parfait · Accepted Answer · 2015-12-13 06:06:46Z

Consider parsing the html table with lxml using xpath expressions and then incorporating lists into a data frame:

import urllib.request as rq
import lxml.etree as et
import pandas as pd

# DOWNLOAD WEB PAGE CONTENT
rqpage = rq.urlopen('https://simple.wikipedia.org/wiki/List_of_U.S._states')
txtpage = rqpage.read()
dom = et.HTML(txtpage)

# XPATH EXPRESSIONS TO LISTS (SKIPPING HEADER COLUMN)
abbreviation= dom.xpath("//table[@class='wikitable']/tr[position()>1]/td[1]/b/text()")
state = dom.xpath("//table[@class='wikitable']/tr[position()>1]//td[2]/a/text()")
capital = dom.xpath("//table[@class='wikitable']/tr[position()>1]//td[3]/a/text()")
incorporated = dom.xpath("//table[@class='wikitable']/tr[position()>1]//td[4]/text()")

# CONVERT LISTS TO DATA FRAME
df = pd.DataFrame({'Abbreviation':abbreviation,
                   'State':state,
                   'Capital':capital,
                   'Incorporated':incorporated})

print(df.head())

#   Abbreviation      Capital       Incorporated       State
#0            AL   Montgomery  December 14, 1819     Alabama
#1            AK       Juneau    January 3, 1959      Alaska
#2            AZ      Phoenix  February 14, 1912     Arizona
#3            AR  Little Rock      June 15, 1836    Arkansas
#4            CA   Sacramento  September 9, 1850  California

Draken · Accepted Answer · 2017-05-30 08:23:30Z

0

Try and use conda to install html5lib instead of pip. Worked for me.

edited May 30, 2017 at 8:23

Draken

3,19614 gold badges38 silver badges56 bronze badges

answered May 30, 2017 at 4:40

Dipak Agrawal

1

Collectives™ on Stack Overflow

Error using read_html() in pandas

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related