How to select only string (non-numeric) columns when there are mixed type columns?

Question

Suppose I have a data frame with three columns with dtypes (object, int, and float):

df = pd.DataFrame({
    'col1': [1, 2, np.nan, 5],
    'col2': [3, 4, 5, 4],
    'col3': ['This is a text column'] * 4
})

I need to replace the np.nan with None, which is an object (since None becomes to NULL when imported to PostgresSQL).

df.replace({np.nan: None}, inplace=True)

I think (correct me if I'm wrong) None cannot be used in any NumPy/Pandas array except for arrays with dtype object. And so 'col1' above becomes an object column after replace. Now, if I wanted to subset only the string columns (which in this case should only be 'col3'), I can no longer use df.select_dtypes(include=object), which returns all object dtype columns, including 'col1'. I've been working around this by using this hacky solution:

# Select only object columns, which includes 'col1'
(df.select_dtypes(include=object)
   # Hack, after this, 'col1' becomes float again since None becomes np.nan
   .apply(lambda col: col.apply(lambda val: val))
   # Now select only the object columns
   .select_dtypes(include=object))

I'm wondering if there are idiomatic (or less hacky) ways to accomplish this. The use case really arose since I need to get the string columns from a data frame where there are numeric (float or int) columns with missing values represented by None rather than np.nan.

Another solution

Based on Mayank Porwal's solution below:

# The list comprehension returns a boolean list
df.loc[:, [pd.to_numeric(df[col], errors='coerce').isna().all() for col in df.columns.tolist()]]

Mayank Porwal · Accepted Answer · 2021-12-18 06:45:51Z

1

Based on your sample df, you can do something like this:

After replacing np.nan to None, col1 becomes an object:

In [1413]: df.dtypes
Out[1413]: 
col1    object
col2     int64
col3    object
dtype: object

To pick the columns which contains only strings, you can use pd.to_numeric with errors='coerce' and check if the column contains all Nan using isna:

In [1416]: cols = df.select_dtypes('object').columns.tolist()
In [1422]: cols
Out[1422]: ['col1', 'col3']

In [1424]: for i in cols:
      ...:     if pd.to_numeric(df[i], errors='coerce').isna().all():
      ...:         print(f'{i}: String col')
      ...:     else:
      ...:         print(f'{i}: number col')
      ...: 
col1: number col
col3: String col

answered Dec 18, 2021 at 6:45

Mayank Porwal

34.2k9 gold badges45 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Yang Wu Over a year ago

I like this. Nice use of the coerce argument in pd.to_numeric(). Definitely more robust and idiomatic. And it fits well into the framework of my function for my use case.

Corralien · Accepted Answer · 2021-12-18 08:16:49Z

1

Reverse your 2 operations:

Extract object columns and process them.
Convert NaN to None before export to pgsql.

>>> df.dtypes
col1    float64
col2      int64
col3     object
dtype: object

# Step 1: process string columns
>>> df.update(df.select_dtypes('object').agg(lambda x: x.str.upper()))

# Step 2: replace nan by None
>>> df.replace({np.nan: None}, inplace=True)

>>> df
   col1  col2                   col3
0   1.0     3  THIS IS A TEXT COLUMN
1   2.0     4  THIS IS A TEXT COLUMN
2  None     5  THIS IS A TEXT COLUMN
3   5.0     4  THIS IS A TEXT COLUMN

answered Dec 18, 2021 at 8:16

Corralien

121k8 gold badges44 silver badges69 bronze badges

Collectives™ on Stack Overflow

How to select only string (non-numeric) columns when there are mixed type columns?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related