'ValueError: No tables found': Python pd.read_html not loading input files

Question

I am trying to import a series of HTML files with news articles that I have saved in my working directory. I developed the code using one single HTML files and it was working perfectly. However, I have since amended the code to import multiple files.

As you can see from the code below I am using pandas and pd.read_html(). It no longer imports any files and give me the error code 'ValueError: No tables found'.

I have tried with different types of HTML files so that doesn't seem to be the problem. I have also updated all of the packages that I am using. I am using OSX and Python 3.6 and Pandas 0.20.3 in Anaconda Navigator.

It was working, now it's not. What am I doing wrong?

Any tips or clues would be greatly appreciated.

import pandas as pd
from os import listdir
from os.path import isfile, join, splitext
import os

mypath = 'path_to_my_wd'

raw_data = [f for f in listdir(mypath) if (isfile(join(mypath, f)) and splitext(f)[1]=='.html')]

news = pd.DataFrame()

for htmlfile in raw_data:
    articles = pd.read_html(join(mypath, htmlfile), index_col=0) #reads file as html
    data = pd.concat([art for art in articles if 'HD' in art.index.values], 
    axis=1).T.set_index('AN')
    data_export = pd.DataFrame(data, columns=['AN', 'BY', 'SN', 'LP', 'TD']) 
    #selects columns to export
    news = news.append(data_export)

I think you need to use join(mypath, raw_data) in pd.read_html — stellasia
– stellasia, Commented Jul 14, 2018 at 8:51
Thank you for the suggestion @stellasia! However, I still can't make it work. I noticed that I uploaded an amended version of the code. The original has join(mypath, htmlfile) in pd.read_html, but this doesn't make a difference. I have amended the code. Any other suggestions? — Jakob Rasmussen
– Jakob Rasmussen, Commented Jul 17, 2018 at 10:40
Other suggestion would be to create the news dataframe with the columns argument, as you do with data_export to tell pandas about the structure of the dataframe. — stellasia
– stellasia, Commented Jul 17, 2018 at 12:51
Thanks again, really apprciate it @stellasia! Still not working though - very frustrating. — Jakob Rasmussen
– Jakob Rasmussen, Commented Jul 17, 2018 at 17:10

Jakob Rasmussen · Accepted Answer · 2018-07-18 11:04:42Z

1

The HTML files were slightly different in formatting and I needed to pass sort=False to pd.concat(): data = pd.concat([art for art in articles if 'HD' in art.index.values], sort=False, axis=1).T.set_index('AN') This is new in Pandas version 0.23.0. That solved the problem.

answered Jul 18, 2018 at 11:04

Jakob Rasmussen

315 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

'ValueError: No tables found': Python pd.read_html not loading input files

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related