Remove substring from multiple string columns in a pandas DataFrame

Question

I have a list of columns in a dataframe that I want to run through and perform an operation on them. the columns hold datetimes or nothing.

For each column in the list, I would like to trim every value in the column that contains "20" in it to the first 10 characters, otherwise leave it as is.

I've tried this a few ways, but get a variety of errors or imperfect results.

The following version throws an error of " 'str' object has no attribute 'apply'", but if I don't use ".astype(str)", then I get an error of " argument of type 'datetime.datetime' is not iterable".

df_combined[dateColumns] = df_combined[dateColumns].fillna(notFoundText).astype(str)
    print (dateColumns)
    for column in dateColumns:
        for row in range(len(column)):
            print(df_combined[column][row])
            if "20" in (df_combined[column][row]):
                df_combined[column][row].apply(lambda x: x[:10], axis=1)
            print(df_combined[column][row])

Halp. Thanks in advance.

cs95 · Accepted Answer · 2017-10-04 23:50:47Z

3

Loops are considered an abomination in pandas. I'd recommend just doing something like this, with str.contains + np.where.

for c in df.columns:
    # df[c] = df[c].astype(str) # uncomment this if your columns aren't dtype=str 
    df[c] = np.where(df[c].str.contains("20"), df[c].str[:10], df[c])

answered Oct 4, 2017 at 23:50

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

piRSquared · Accepted Answer · 2017-10-05 00:44:56Z

IIUC:

You want to do this over the entire dataframe.
If so, here is a vectorized way using numpy over the entire dataframe at once.

Setup

df = pd.DataFrame([
    ['xxxxxxxx20yyyy', 'z' * 14, 'wwwwwwww20vvvv'],
    ['k' * 14, 'dddddddd20ffff', 'a' * 14]
], columns=list('ABC'))

df

                A               B               C
0  xxxxxxxx20yyyy  zzzzzzzzzzzzzz  wwwwwwww20vvvv
1  kkkkkkkkkkkkkk  dddddddd20ffff  aaaaaaaaaaaaaa

Solution
Using numpy.core.defchararray.find and np.where

from numpy.core.defchararray import find

v = df.values.astype(str)
i, j = np.where(find(v, '20') > -1)

v[i, j] = v[i, j].astype('<U10')

df.loc[:] = v

df

                A               B               C
0      xxxxxxxx20  zzzzzzzzzzzzzz      wwwwwwww20
1  kkkkkkkkkkkkkk      dddddddd20  aaaaaaaaaaaaaa

If you don't want to overwrite the old dataframe, you can create a new one:

pd.DataFrame(v, df.index, df.columns)

                A               B               C
0      xxxxxxxx20  zzzzzzzzzzzzzz      wwwwwwww20
1  kkkkkkkkkkkkkk      dddddddd20  aaaaaaaaaaaaaa

Collectives™ on Stack Overflow

Remove substring from multiple string columns in a pandas DataFrame

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related