Python: Combining Two Rows with Pandas read_excel

Question

I am reading an Excel file using Pandas and I feel like there has to be a better way to handle the way I create column names. This is something like the Excel file I'm reading:

                1       2      # '1' is merged in the two cells above 'a'and 'b'
    Date        a   b   c   d  #  likewise for '2'.  As opposed to 'centered across selection'
1   1-Jan-19    100 200 300 400
2   1-Feb-19    101 201 301 401
3   1-Mar-19    102 202 302 402

I want my to merge the 'a','b','c',and'd' columns heads with the '1'and '2' above them, so I'm doing the following to get my headers the way that I want:

import pandas as pd
import json

xls = pd.ExcelFile(r'C:\Path_to\Excel_Pandas_Connector_Test.xls')
df = pd.read_excel(xls, 'Sheet1', header=[1])  # uses the abcd row as column names

#  I only want the most recent day of data so I do the following
json_str = df[df.Date == df['Date'].max()].to_json(orient='records',date_format='iso')

dat_data = json.loads(json_str)[0]

def clean_json():
    global dat_data
    dat_data['1a']      = dat_data.pop('a')
    dat_data['1b']      = dat_data.pop('b')
    dat_data['2c']      = dat_data.pop('c')
    dat_data['2d']      = dat_data.pop('d')

clean_json()

print(json.dumps(dat_data,indent=4))

My desired output is:

{
"Date": "2019-03-01T00:00:00.000Z",
"1a": 102,
"1b": 202,
"2c": 302,
"2d": 402
}

This works as written, but is there a Pandas built-in that I could have used to do the same thing instead of the clean_json function?

Graipher · Accepted Answer · 2019-04-12 08:28:37Z

Yes, there is an easier way, using pandas.Index.get_level_values.

First, I could only get your example dataframe when calling the read with df = pd.read_excel("/tmp/temp.xls", header=[0, 1]), so I get both headers correctly.

Then you can just do this:

import pandas as pd
import json

# read df
df = pd.read_excel("/tmp/temp.xls", header=[0, 1])
df.index = pd.to_datetime(df.index)

# combine multilevel columns to one level
df.columns = (pd.Series(df.columns.get_level_values(0)).apply(str)
              + pd.Series(df.columns.get_level_values(1)).apply(str))

# get Date as a column
df = df.reset_index()
df.columns = ["Date"] + list(df.columns[1:])

print(df)
#          1a   1b   2c   2d
# 2019-01-02  100  200  300  400
# 2019-01-02  101  201  301  401
# 2019-01-03  102  202  302  402

After which you can just do something similar to what you are doing, but directly getting the index of the maximum instead of comparing all values to the value of the maximum:

json_data = json.loads(df.loc[df.Date.idxmax()].to_json(date_format='iso'))
print(json.dumps(json_data, indent=4))

Which produces the desired output:

{
    "Date": "2019-01-03T00:00:00.000Z",
    "1a": 102,
    "1b": 202,
    "2c": 302,
    "2d": 402
}

Thanks, that's really concise and works well. I can see that there is a lot to learn in Pandas. — Virgilio
– Virgilio, Commented Apr 12, 2019 at 12:24

Stack Exchange Network

Python: Combining Two Rows with Pandas read_excel

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Python: Combining Two Rows with Pandas read_excel

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions