Suppose I have a data frame with three columns with dtypes (object, int, and float):
df = pd.DataFrame({
'col1': [1, 2, np.nan, 5],
'col2': [3, 4, 5, 4],
'col3': ['This is a text column'] * 4
})
I need to replace the np.nan with None, which is an object (since None becomes to NULL when imported to PostgresSQL).
df.replace({np.nan: None}, inplace=True)
I think (correct me if I'm wrong) None cannot be used in any NumPy/Pandas array except for arrays with dtype object. And so 'col1' above becomes an object column after replace. Now, if I wanted to subset only the string columns (which in this case should only be 'col3'), I can no longer use df.select_dtypes(include=object), which returns all object dtype columns, including 'col1'. I've been working around this by using this hacky solution:
# Select only object columns, which includes 'col1'
(df.select_dtypes(include=object)
# Hack, after this, 'col1' becomes float again since None becomes np.nan
.apply(lambda col: col.apply(lambda val: val))
# Now select only the object columns
.select_dtypes(include=object))
I'm wondering if there are idiomatic (or less hacky) ways to accomplish this. The use case really arose since I need to get the string columns from a data frame where there are numeric (float or int) columns with missing values represented by None rather than np.nan.
Another solution
Based on Mayank Porwal's solution below:
# The list comprehension returns a boolean list
df.loc[:, [pd.to_numeric(df[col], errors='coerce').isna().all() for col in df.columns.tolist()]]