1

I am trying to import a series of HTML files with news articles that I have saved in my working directory. I developed the code using one single HTML files and it was working perfectly. However, I have since amended the code to import multiple files.

As you can see from the code below I am using pandas and pd.read_html(). It no longer imports any files and give me the error code 'ValueError: No tables found'.

I have tried with different types of HTML files so that doesn't seem to be the problem. I have also updated all of the packages that I am using. I am using OSX and Python 3.6 and Pandas 0.20.3 in Anaconda Navigator.

It was working, now it's not. What am I doing wrong?

Any tips or clues would be greatly appreciated.

import pandas as pd
from os import listdir
from os.path import isfile, join, splitext
import os

mypath = 'path_to_my_wd'

raw_data = [f for f in listdir(mypath) if (isfile(join(mypath, f)) and splitext(f)[1]=='.html')]

news = pd.DataFrame()

for htmlfile in raw_data:
    articles = pd.read_html(join(mypath, htmlfile), index_col=0) #reads file as html
    data = pd.concat([art for art in articles if 'HD' in art.index.values], 
    axis=1).T.set_index('AN')
    data_export = pd.DataFrame(data, columns=['AN', 'BY', 'SN', 'LP', 'TD']) 
    #selects columns to export
    news = news.append(data_export)
4
  • I think you need to use join(mypath, raw_data) in pd.read_html Commented Jul 14, 2018 at 8:51
  • Thank you for the suggestion @stellasia! However, I still can't make it work. I noticed that I uploaded an amended version of the code. The original has join(mypath, htmlfile) in pd.read_html, but this doesn't make a difference. I have amended the code. Any other suggestions? Commented Jul 17, 2018 at 10:40
  • Other suggestion would be to create the news dataframe with the columns argument, as you do with data_export to tell pandas about the structure of the dataframe. Commented Jul 17, 2018 at 12:51
  • Thanks again, really apprciate it @stellasia! Still not working though - very frustrating. Commented Jul 17, 2018 at 17:10

1 Answer 1

1

The HTML files were slightly different in formatting and I needed to pass sort=False to pd.concat(): data = pd.concat([art for art in articles if 'HD' in art.index.values], sort=False, axis=1).T.set_index('AN') This is new in Pandas version 0.23.0. That solved the problem.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.