0

I have an excel sheet that looks like the following.

enter image description here

I would like to be able to extract each table into a pandas dataframe within my Python script (eg df1 = table_header, df2 = table_header_2). This issue has been dealt with both here, and here. The first answer is hidden behind a paid wall. The second, I believe that @Rotem provided quite an eloquent solution, however upon applying it, I ran into issue detecting the beginning of the first table, and with indexing. I may be able to solve these issues with a little help, but there was another idea I wanted to explore.

If I know the names of the table headers, and can expect them to exist for every table, and I know I can find their index using openpyxl, can I perform some sort of edge detection similar to what @Rotem used in the second link I provided to extract all cells attached to that table header? And is there a more simple way of doing it than iterating over rows/columns and detecting change in number of non-None values? Important to note is that, even though I know the table header names, I do not necessarily know the index of those headers as the size of the table may change. This solution appears to do something very much along these lines, however I fail to understand how is extracts all the cells from the associated and attached table. I am finding myself a little out of my depth with this one.

Thanks in advance for your suggestions.

2 Answers 2

4

-# This answer if for data which is NOT contained in 'Excel Tables' #-

I'll add this as a second answer so any comments on it don't add to the long trail on the other answer.

Where the data is not contained in a table, it's necessary to find the range "top left cell" (tlc) and the range or "bottom right cell" (brc).
In this example working with the same data the code looks for the "header" name. I'm using 'table header 1' and 'table header 2' as the demarcation for the two section (I changed name in cell 'A1" 'table header 1'). The headers are added to a list section_headers which contains all the header names used in the sheet.

  1. Given the two data sets in the example both have their tlc in column A I am only searching that column. If this is not the case in your actual sheet then you may need to include other columns if it specific columns only will have the tlc or the whole used range if they could occur anywhere.
  2. The code checks the value in each cell in Column A till it finds from A1 to last used row. If it finds a value that matches one of the headers in the list 'section_headers' it then tries to find the range of the section by checking each cell from one row down, across the columns until it his an empty cell (i.e. contains value Python None). Then does the same down the rows.
  3. Once it get the last column and row (i.e. the brc) it then uses the same function to convert to df as before.

This code determines the last column and last row from the first cell below the header (so in 'table header 1' this is cell 'A2'). Therefore there is the assumption that the data is even in rows and columns and matches what is measured from that cell.

from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
from openpyxl.utils.cell import get_column_letter as gcl
from openpyxl.utils.cell import coordinate_from_string as cfs
import pandas as pd


def convert_rng_to_df(tlc, l_col, l_row, sheet):
    first_col = cfs(tlc)[0]
    first_row = cfs(tlc)[1]

    rng = f"{first_col}{first_row+1}:{l_col}{l_row}"

    data_rows = []
    for row in sheet[rng]:
        data_rows.append([cell.value for cell in row])

    return pd.DataFrame(data_rows, columns=get_column_interval(first_col, l_col))


filename = 'foo.xlsx'
wb = load_workbook(filename)
ws = wb['Sheet1']

### Add the names of each section header to this list
section_headers = ['table header 1', 'table header 2']

last_col = ''
last_row = ''
df_dict = {}  # Dictionary to hold the dataframes
for cell in ws['A']:  # Looping Column A only
    if cell.value in section_headers:
        tblname = cell.value  # Header of the Data Set found
        tlc = cell.coordinate  # Top Left Cell of the range
        start_row = cfs(tlc)[1]  #
        for x in range(1, ws.max_column+1):  # Find the last used column for the data in this section
            if cell.offset(row=1, column=x).value is None:
                last_col = gcl(x)
                break
        for y in range(1, ws.max_row):  # Find the last used row for the data in this section
            if cell.offset(row=y, column=1).value is None:
                last_row = (start_row + y) - 1
                break

        print(f"Range to convert for '{tblname}' is: '{tlc}:{last_col}{last_row}'")
        df_dict[tblname] = convert_rng_to_df(tlc, last_col, ws)  # Convert to dataframe

print("\n")
### Print the DataFrames
for table_name, df in df_dict.items():
    print(f"DataFrame from '{table_name}'")
    print(df)
    print("----------------------------------\n")

The output from this code

Range to convert for 'table header 1' is: 'A1:B8'
Range to convert for 'table header 2' is: 'A10:C15'


DataFrame from 'table header 1'
        A        B
0  value1  value11
1  value2  value12
2  value3  value13
3  value4  value14
4  value5  value15
5  value6  value16
6  value7  value17
----------------------------------

DataFrame from 'table header 2'
        A        B        C
0  valueA  valueAA  valueBA
1  valueB  valueAB  valueBB
2  valueC  valueAC  valueBC
3  valueD  valueAD  valueBD
4  valueE  valueAF  valueBE
----------------------------------
Sign up to request clarification or add additional context in comments.

1 Comment

Wow, this is incredible! Exactly what I was trying to accomplish. Thanks a million!
2

-# This answer if for data which is contained in 'Excel Tables' #-

You can obtain the table information (co-ordinates or range) using Openpyxl and use a common method to read that range into a DataFrame.

To make it clearer I have changed your example tables to have unique values and headers.

from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
import pandas as pd
from openpyxl.utils.cell import coordinate_from_string as cfs


def convert_rng_to_df(tbl_coords, sheet):
    col_start = cfs(tbl_coords.split(':')[0])[0]
    col_end = cfs(tbl_coords.split(':')[1])[0]

    data_rows = []
    for row in sheet[tbl_coords]:
        data_rows.append([cell.value for cell in row])

    df = pd.DataFrame(data_rows, columns=get_column_interval(col_start, col_end))

    df.columns = df.iloc[0] # Change headers to first row
    df = df[1:]  # remove first row from DataFrame to remove the duplicate

    return df


filename = 'foo.xlsx'
wb = load_workbook(filename)
ws = wb['Sheet1']

### Dictionary to hold the dfs for each table
df_dict = {}

### Get the table coordinates from the worksheet table dictionary
for tblname, tblcoord in ws.tables.items():
    print(f'Table Name: {tblname}, Coordinate: {tblcoord}')
    df_dict[tblname] = convert_rng_to_df(tblcoord, ws)  # Convert to dataframe

### Print the DataFrames
for table_name, df in df_dict.items():
    print(f"DataFrame from Table '{table_name}'")
    print(df)
    print("----------------------------------\n")

Updated output;
Note; The column headers now include the index number from the 1st row as index header. If that is not desired the code can be changed to omit that.

Table Name: Table1, Coordinate: A1:B8
Table Name: Table2, Coordinate: A10:C15

DataFrame from Table 'Table1'
0 table header  Column1
1       value1  value11
2       value2  value12
3       value3  value13
4       value4  value14
5       value5  value15
6       value6  value16
7       value7  value17
----------------------------------

DataFrame from Table 'Table2'
0 table header 2  Column1  Column2
1         valueA  valueAA  valueBA
2         valueB  valueAB  valueBB
3         valueC  valueAC  valueBC
4         valueD  valueAD  valueBD
5         valueE  valueAF  valueBE
----------------------------------

12 Comments

Wow looking good. just trying to make sense of it. So if I understand correctly I pass a df "tables' which contains each table name and the coordinates of its header?
hmm but I can only find out the coordinate of the table header (single cell) I do not know what the range of the table is. I am not sure but I don't think that is what convert_rng_to_df is doing is it?
Yes the co-ordinates are extracted from the table data; table 1 co-ordinates are A1 - B8, that is the whole range of the table, headers and data. So to for table 2, it's range is A10 - C15. These ranges are used to create the df, that is all that is passed to the function (other than the sheet object).
Right but are those values (eg. A1 - B8) extracted by the script? My understanding is that I have to input those cell coordinates. I do not know what they might be if a new table is input. The table header names will remain the same but the size of their associated tables may change.
No, the co-ordinates are extracted by Openpyxl from its worksheet object. ws.tables where ws is the worksheet object, is a dictionary. The key is the table name and the value is the table co-ordinates (or range). So we loop the dictionary (ws.tables.items() extracting both of these items and then use the value (range) in the function. The table will whatever size it is when Openpyxl reads it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.