Conditional editing of strings in a Pandas DataFrame

Question

I am learning Pandas and have a DataFrame of strings which looks a little like this:

df = pd.DataFrame([['Apple', 'Med6g7867'], ['Orange', 'Med7g8976'], ['Banana', 'Signal'], ['Peach', 'Med8g8989'], ['Mango', 'Possible result %gggyy']], columns=['A', 'B'])
df
    A       B
0   Apple   Med6g7867
1   Orange  Med7g8976
2   Banana  Signal
3   Peach   Med8g8989
4   Mango   Possible result %gggyy

Note column B has two types of value, either a unique identifier of the form MedXgXXXX or a descriptive string. I would like to do two related things.

Substitute all the values of B with the unique identifier to NaN
Retain the descriptive string but truncate any that have a % sign so that I only retain the string prior to the % sign.

I would like a table like this:

    A       B
0   Apple   NaN
1   Orange  NaN
2   Banana  Signal
3   Peach   NaN
4   Mango   Possible result

Currently I can subset the table like so:

df[df['B'].str.contains("Med")]
df[df['B'].str.contains("%")]

but no implementation of replace() I try allows me to do this.

Any help appreciated.

unutbu · Accepted Answer · 2016-08-28 17:09:35Z

import pandas as pd
df = pd.DataFrame([['Apple', 'Med6g7867'],
                   ['Orange', 'Med7g8976'],
                   ['Banana', 'Signal'],
                   ['Peach', 'Med8g8989'],
                   ['Mango', 'Possible result %gggyy']],
                  columns=['A', 'B'])

df['B'] = df['B'].str.extract(r'(?:^Med.g.{4})|([^%]+)', expand=False)
print(df)

yields

        A                 B
0   Apple               NaN
1  Orange               NaN
2  Banana            Signal
3   Peach               NaN
4   Mango  Possible result

The regex pattern has the following meaning:

(?:            # start a non-capturing group
  ^            # match the start of the string
  Med          # match the literal string Med
  .            # followed by any character
  g            # a literal g
  .{4}         # followed by any 4 characters
)              # end the non-capturing group
|              # OR
(              # start a capturing group
  [^%]+        # 1-or-more of any characters except %
)              # end capturing group

If the value in the B column starts with a unique indentifier of the form MedXgXXXX then the non-capturing group will be matched. Since str.extract only returns the value from capturing groups, the Series returned by str.extract will have a NaN at this location.

If instead the capturing group is matched, then str.extract will return the matched value.

Nehal J Wani · Accepted Answer · 2016-08-28 09:43:17Z

2

You can apply replace twice like this:

In [460]: df
Out[460]: 
        A                       B
0   Apple               Med6g7867
1  Orange               Med7g8976
2  Banana                  Signal
3   Peach               Med8g8989
4   Mango  Possible result %gggyy

In [461]: df.replace(r'Med\dg\d{4}', np.nan, regex=True).replace(r'\s+%.*', '', regex=True)
Out[461]: 
        A                B
0   Apple              NaN
1  Orange              NaN
2  Banana           Signal
3   Peach              NaN
4   Mango  Possible result

answered Aug 28, 2016 at 9:43

Nehal J Wani

16.7k3 gold badges72 silver badges93 bronze badges

Collectives™ on Stack Overflow

Conditional editing of strings in a Pandas DataFrame

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related