Handling empty value in Excel by pandas read_excel()

Question

I commented a closed issue in Pandas github repo:

Treating an empty value in Excel as nan has another side effect: integer will be converted to float. Subsequent operation on that columns will have other effects again.

Also read_excel() does not honor empty value handling offered by function in converters:

I have an Excel file temp.xlsx with the following data:

There are white spaces surrounding values in Key3 column.

Key1,Key2,Key3,Key4
0,11,  Apple  ,1.12
1,12,,1.02
2,13,  Orange,
3,  ,Banana  ,0.01

This is the code:

import numpy as np
import pandas as pd

def handle_string(value):
    return value.replace(' ', '')

def handle_integer(value):
    if value == '':
        return 0
    else:
        int(value)

def handle_float(value):
    if value == '':
        return 0.0
    else:
        float(value)

df = pd.read_excel(
        'temp.xlsx',
)
print(df)
print(f"type(df.loc[3,'Key2']) = {type(df.loc[3,'Key2'])}")
print(f"type(df.loc[1,'Key3']) = {type(df.loc[1,'Key3'])}")
print(f"type(df.loc[2,'Key4']) = {type(df.loc[2,'Key4'])}")

print('')

df = pd.read_excel(
        'temp.xlsx',
        converters={\
            'Key1' : handle_integer,
            'Key2' : handle_integer,
            'Key3' : handle_string,
            'Key4' : handle_float,
        }
)
print(df)
print(f"type(df.loc[3,'Key2']) = {type(df.loc[3,'Key2'])}")
print(f"type(df.loc[1,'Key3']) = {type(df.loc[1,'Key3'])}")
print(f"type(df.loc[2,'Key4']) = {type(df.loc[2,'Key4'])}")

The output:

   Key1  Key2        Key3  Key4
0     0  11.0     Apple    1.12
1     1  12.0         NaN  1.02
2     2  13.0    Orange     NaN
3     3   NaN    Banana    0.01
type(df.loc[3,'Key2']) = <class 'numpy.float64'>
type(df.loc[1,'Key3']) = <class 'float'>
type(df.loc[2,'Key4']) = <class 'numpy.float64'>

   Key1  Key2    Key3  Key4
0  None   NaN   Apple   NaN
1  None   NaN     NaN   NaN
2  None   NaN  Orange   0.0
3  None   0.0  Banana   NaN
type(df.loc[3,'Key2']) = <class 'numpy.float64'>
type(df.loc[1,'Key3']) = <class 'float'>
type(df.loc[2,'Key4']) = <class 'numpy.float64'>

dtype parameters have lower priority than converters.

Charles Landau · Accepted Answer · 2018-12-11 16:34:31Z

1

I could be wrong, but it looks to me like the problem is related to your return values for these functions. In two places, you return None without apparently intending to. See below:

def handle_string(value):
    return value.replace(' ', '')

def handle_integer(value):
    if value == '':
        return 0
    else:
        int(value) # Returns none!!!

def handle_float(value):
    if value == '':
        return 0.0
    else:
        float(value) # Returns none!!!

answered Dec 11, 2018 at 16:34

Charles Landau

4,2751 gold badge13 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Charles Landau Over a year ago

@yoonghm great! If that addresses your original question please mark the post as answered

Collectives™ on Stack Overflow

Handling empty value in Excel by pandas read_excel()

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related