2

I am learning Pandas and have a DataFrame of strings which looks a little like this:

df = pd.DataFrame([['Apple', 'Med6g7867'], ['Orange', 'Med7g8976'], ['Banana', 'Signal'], ['Peach', 'Med8g8989'], ['Mango', 'Possible result %gggyy']], columns=['A', 'B'])
df
    A       B
0   Apple   Med6g7867
1   Orange  Med7g8976
2   Banana  Signal
3   Peach   Med8g8989
4   Mango   Possible result %gggyy

Note column B has two types of value, either a unique identifier of the form MedXgXXXX or a descriptive string. I would like to do two related things.

  1. Substitute all the values of B with the unique identifier to NaN
  2. Retain the descriptive string but truncate any that have a % sign so that I only retain the string prior to the % sign.

I would like a table like this:

    A       B
0   Apple   NaN
1   Orange  NaN
2   Banana  Signal
3   Peach   NaN
4   Mango   Possible result

Currently I can subset the table like so:

df[df['B'].str.contains("Med")]
df[df['B'].str.contains("%")]

but no implementation of replace() I try allows me to do this.

Any help appreciated.

2 Answers 2

2
import pandas as pd
df = pd.DataFrame([['Apple', 'Med6g7867'],
                   ['Orange', 'Med7g8976'],
                   ['Banana', 'Signal'],
                   ['Peach', 'Med8g8989'],
                   ['Mango', 'Possible result %gggyy']],
                  columns=['A', 'B'])

df['B'] = df['B'].str.extract(r'(?:^Med.g.{4})|([^%]+)', expand=False)
print(df)

yields

        A                 B
0   Apple               NaN
1  Orange               NaN
2  Banana            Signal
3   Peach               NaN
4   Mango  Possible result 

The regex pattern has the following meaning:

(?:            # start a non-capturing group
  ^            # match the start of the string
  Med          # match the literal string Med
  .            # followed by any character
  g            # a literal g
  .{4}         # followed by any 4 characters
)              # end the non-capturing group
|              # OR
(              # start a capturing group
  [^%]+        # 1-or-more of any characters except %
)              # end capturing group

If the value in the B column starts with a unique indentifier of the form MedXgXXXX then the non-capturing group will be matched. Since str.extract only returns the value from capturing groups, the Series returned by str.extract will have a NaN at this location.

If instead the capturing group is matched, then str.extract will return the matched value.

Sign up to request clarification or add additional context in comments.

Comments

2

You can apply replace twice like this:

In [460]: df
Out[460]: 
        A                       B
0   Apple               Med6g7867
1  Orange               Med7g8976
2  Banana                  Signal
3   Peach               Med8g8989
4   Mango  Possible result %gggyy

In [461]: df.replace(r'Med\dg\d{4}', np.nan, regex=True).replace(r'\s+%.*', '', regex=True)
Out[461]: 
        A                B
0   Apple              NaN
1  Orange              NaN
2  Banana           Signal
3   Peach              NaN
4   Mango  Possible result

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.