I am trying to import a series of HTML files with news articles that I have saved in my working directory. I developed the code using one single HTML files and it was working perfectly. However, I have since amended the code to import multiple files.
As you can see from the code below I am using pandas and pd.read_html(). It no longer imports any files and give me the error code 'ValueError: No tables found'.
I have tried with different types of HTML files so that doesn't seem to be the problem. I have also updated all of the packages that I am using. I am using OSX and Python 3.6 and Pandas 0.20.3 in Anaconda Navigator.
It was working, now it's not. What am I doing wrong?
Any tips or clues would be greatly appreciated.
import pandas as pd
from os import listdir
from os.path import isfile, join, splitext
import os
mypath = 'path_to_my_wd'
raw_data = [f for f in listdir(mypath) if (isfile(join(mypath, f)) and splitext(f)[1]=='.html')]
news = pd.DataFrame()
for htmlfile in raw_data:
articles = pd.read_html(join(mypath, htmlfile), index_col=0) #reads file as html
data = pd.concat([art for art in articles if 'HD' in art.index.values],
axis=1).T.set_index('AN')
data_export = pd.DataFrame(data, columns=['AN', 'BY', 'SN', 'LP', 'TD'])
#selects columns to export
news = news.append(data_export)
join(mypath, raw_data)inpd.read_htmljoin(mypath, htmlfile)inpd.read_html, but this doesn't make a difference. I have amended the code. Any other suggestions?newsdataframe with thecolumnsargument, as you do withdata_exportto tell pandas about the structure of the dataframe.