1

Suppose I have a data frame with three columns with dtypes (object, int, and float):

df = pd.DataFrame({
    'col1': [1, 2, np.nan, 5],
    'col2': [3, 4, 5, 4],
    'col3': ['This is a text column'] * 4
})

I need to replace the np.nan with None, which is an object (since None becomes to NULL when imported to PostgresSQL).

df.replace({np.nan: None}, inplace=True)

I think (correct me if I'm wrong) None cannot be used in any NumPy/Pandas array except for arrays with dtype object. And so 'col1' above becomes an object column after replace. Now, if I wanted to subset only the string columns (which in this case should only be 'col3'), I can no longer use df.select_dtypes(include=object), which returns all object dtype columns, including 'col1'. I've been working around this by using this hacky solution:

# Select only object columns, which includes 'col1'
(df.select_dtypes(include=object)
   # Hack, after this, 'col1' becomes float again since None becomes np.nan
   .apply(lambda col: col.apply(lambda val: val))
   # Now select only the object columns
   .select_dtypes(include=object))

I'm wondering if there are idiomatic (or less hacky) ways to accomplish this. The use case really arose since I need to get the string columns from a data frame where there are numeric (float or int) columns with missing values represented by None rather than np.nan.

Another solution

Based on Mayank Porwal's solution below:

# The list comprehension returns a boolean list
df.loc[:, [pd.to_numeric(df[col], errors='coerce').isna().all() for col in df.columns.tolist()]]

2 Answers 2

1

Based on your sample df, you can do something like this:

After replacing np.nan to None, col1 becomes an object:

In [1413]: df.dtypes
Out[1413]: 
col1    object
col2     int64
col3    object
dtype: object

To pick the columns which contains only strings, you can use pd.to_numeric with errors='coerce' and check if the column contains all Nan using isna:

In [1416]: cols = df.select_dtypes('object').columns.tolist()
In [1422]: cols
Out[1422]: ['col1', 'col3']

In [1424]: for i in cols:
      ...:     if pd.to_numeric(df[i], errors='coerce').isna().all():
      ...:         print(f'{i}: String col')
      ...:     else:
      ...:         print(f'{i}: number col')
      ...: 
col1: number col
col3: String col
Sign up to request clarification or add additional context in comments.

1 Comment

I like this. Nice use of the coerce argument in pd.to_numeric(). Definitely more robust and idiomatic. And it fits well into the framework of my function for my use case.
1

Reverse your 2 operations:

  1. Extract object columns and process them.
  2. Convert NaN to None before export to pgsql.
>>> df.dtypes
col1    float64
col2      int64
col3     object
dtype: object

# Step 1: process string columns
>>> df.update(df.select_dtypes('object').agg(lambda x: x.str.upper()))

# Step 2: replace nan by None
>>> df.replace({np.nan: None}, inplace=True)

>>> df
   col1  col2                   col3
0   1.0     3  THIS IS A TEXT COLUMN
1   2.0     4  THIS IS A TEXT COLUMN
2  None     5  THIS IS A TEXT COLUMN
3   5.0     4  THIS IS A TEXT COLUMN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.