-# This answer if for data which is NOT contained in 'Excel Tables' #-
I'll add this as a second answer so any comments on it don't add to the long trail on the other answer.
Where the data is not contained in a table, it's necessary to find the range "top left cell" (tlc) and the range or "bottom right cell" (brc).
In this example working with the same data the code looks for the "header" name. I'm using 'table header 1' and 'table header 2' as the demarcation for the two section (I changed name in cell 'A1" 'table header 1'). The headers are added to a list section_headers which contains all the header names used in the sheet.
- Given the two data sets in the example both have their tlc in column A I am only searching that column. If this is not the case in your actual sheet then you may need to include other columns if it specific columns only will have the tlc or the whole used range if they could occur anywhere.
- The code checks the value in each cell in Column A till it finds from A1 to last used row. If it finds a value that matches one of the headers in the list 'section_headers' it then tries to find the range of the section by checking each cell from one row down, across the columns until it his an empty cell (i.e. contains value Python None). Then does the same down the rows.
- Once it get the last column and row (i.e. the brc) it then uses the same function to convert to df as before.
This code determines the last column and last row from the first cell below the header (so in 'table header 1' this is cell 'A2'). Therefore there is the assumption that the data is even in rows and columns and matches what is measured from that cell.
from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
from openpyxl.utils.cell import get_column_letter as gcl
from openpyxl.utils.cell import coordinate_from_string as cfs
import pandas as pd
def convert_rng_to_df(tlc, l_col, l_row, sheet):
first_col = cfs(tlc)[0]
first_row = cfs(tlc)[1]
rng = f"{first_col}{first_row+1}:{l_col}{l_row}"
data_rows = []
for row in sheet[rng]:
data_rows.append([cell.value for cell in row])
return pd.DataFrame(data_rows, columns=get_column_interval(first_col, l_col))
filename = 'foo.xlsx'
wb = load_workbook(filename)
ws = wb['Sheet1']
### Add the names of each section header to this list
section_headers = ['table header 1', 'table header 2']
last_col = ''
last_row = ''
df_dict = {} # Dictionary to hold the dataframes
for cell in ws['A']: # Looping Column A only
if cell.value in section_headers:
tblname = cell.value # Header of the Data Set found
tlc = cell.coordinate # Top Left Cell of the range
start_row = cfs(tlc)[1] #
for x in range(1, ws.max_column+1): # Find the last used column for the data in this section
if cell.offset(row=1, column=x).value is None:
last_col = gcl(x)
break
for y in range(1, ws.max_row): # Find the last used row for the data in this section
if cell.offset(row=y, column=1).value is None:
last_row = (start_row + y) - 1
break
print(f"Range to convert for '{tblname}' is: '{tlc}:{last_col}{last_row}'")
df_dict[tblname] = convert_rng_to_df(tlc, last_col, ws) # Convert to dataframe
print("\n")
### Print the DataFrames
for table_name, df in df_dict.items():
print(f"DataFrame from '{table_name}'")
print(df)
print("----------------------------------\n")
The output from this code
Range to convert for 'table header 1' is: 'A1:B8'
Range to convert for 'table header 2' is: 'A10:C15'
DataFrame from 'table header 1'
A B
0 value1 value11
1 value2 value12
2 value3 value13
3 value4 value14
4 value5 value15
5 value6 value16
6 value7 value17
----------------------------------
DataFrame from 'table header 2'
A B C
0 valueA valueAA valueBA
1 valueB valueAB valueBB
2 valueC valueAC valueBC
3 valueD valueAD valueBD
4 valueE valueAF valueBE
----------------------------------