5

I would like to extract multiple tables of a series of excel spreadsheets where some sheets may contain more than one table, to store the tables separately as e.g. csv files. The table could be something like this:

enter image description here

if I read it using pandas read_excel

import pandas as pd
pd.read_excel('table_example.xlsx',header=None)

I would get something like this:

enter image description here

how could I extract the different tables? In my case tables have NaN values, which is potentially an additional complication.

[EDIT1] something similar to the excel tables can be generated using pandas:

df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)

I have tried to find the limits of the tables using built-in pandas functions:

df[df.isnull().all(axis=1)]

I could use the first and second row to set an horizontal division and maybe make a first split but I don't know how to select the cells above or below the identified rows. Or even if this is the most convenient approach.

disclaimer: in my case the tables always have a tag in the row above the header, this is because these tables are read by a non-python software which uses them to identify where the table starts. I decided to not consider these tags to ask a more generic problem, that other people may encounter.

3
  • The pandas parsing code is already quite complex in an effort to do the right thing in most cases. This seems unlikely to ever be one of the right cases. So I would suggest that it would seem that the markers are there for a good reason. Commented Apr 2, 2017 at 21:44
  • I'm lazy but if you provide a dump of your example I can load instead of images I may show you an possible approach Commented Apr 3, 2017 at 13:48
  • 1
    Assuming your tables are nicely structured, you can try to identify entire rows of NaN, and use those indexes to slice your df (This assumes that each table has at least blank row before and after the table). Then for each slice, just drop all the NaN columns. Commented Apr 3, 2017 at 14:36

2 Answers 2

5

This may help to dynamically locate and extract the table as long as the 2 tables are separated by either a row or column of NaNs.

I used the boundingbox solution from https://stackoverflow.com/a/54675526

from skimage.measure import label, regionprops
#this basically converts your table into 0s and 1s where 0 is NaN and 1 for non NaN 
binary_rep = np.array(df.notnull().astype('int'))

list_of_dataframes = []
l = label(binary_rep)
for s in regionprops(l):
    #the bbox contains the extremes of the bounding box. So the top left and bottom right cell locations of the table.
    list_of_dataframes.append(df.iloc[s.bbox[0]:s.bbox[2],s.bbox[1]:s.bbox[3]])

Sign up to request clarification or add additional context in comments.

1 Comment

The code extracts single line and paragraph as tables. To avoid this, we filter the table/data frame which contains at least two columns. But tables with single column and multiple row will be discarded. In most cases, table contains at least two columns, so it is ok to apply the filter in loop to avoid single line and paragraph. Here is the updated code in for loop if df.iloc[s.bbox[0]:s.bbox[2],s.bbox[1]:s.bbox[3]].shape[1]>=2: list_of_dataframes.append(df.iloc[s.bbox[0]:s.bbox[2],s.bbox[1]:s.bbox[3]])
4
import numpy as np
import pandas as pd

# I have assumed that the tables are "separated" by at least one row with only NaN values

df=pd.DataFrame(np.nan,index=range(0,10),columns=range(0,10))
df.iloc[1,2:5]=['t1h1','t1h2','t1h3']
df.iloc[2:5,2:5]=np.random.randn(3,3)
df.iloc[6,3:7]=['t2h1','t2h2','t2h3','t2h4']
df.iloc[7:9,3:7]=np.random.randn(2,4)

print(df)

# Extract by rows

nul_rows = list(df[df.isnull().all(axis=1)].index)

list_of_dataframes = []
for i in range(len(nul_rows) - 1):
    list_of_dataframes.append(df.iloc[nul_rows[i]+1:nul_rows[i+1],:])


# Remove null columns

cleaned_tables = []
for _df in list_of_dataframes:
    cleaned_tables.append(_df.dropna(axis=1, how='all'))

# cleaned_tables is a list of the dataframes

print(cleaned_tables[0])
print(cleaned_tables[1])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.