How to replace partial strings within a df column

Question

I have the following table. Some entries within the StockCode column have letters at the end, how do I replace these with a number?

I could make a dictionary to map each letter to a number but I feel there a quicker way to do so.

InvoiceNo  |StockCode |
0   536     85123   
1   536     71053Z  
2   536     84406B  
3   536     22623S

with what number exactly? or can should it be with random number? — Ch3steR
– Ch3steR, Commented Nov 28, 2020 at 16:51
A dictionary to map the replacements is a good idea. Then write a little function to do the replacement and use apply() to map the function to each value in the column. pandas.pydata.org/pandas-docs/stable/reference/api/… — Matt L.
– Matt L., Commented Nov 28, 2020 at 16:57
I understood that but with what number do what to replace? Let's say to want to replace z, with what number you want to replace z? 1?..2?...3??...4??? or you want to replace with any number randomly? — Ch3steR
– Ch3steR, Commented Nov 28, 2020 at 17:18
@Ch3steR Ah my bad, random as fine as long as it's consistent. so z should always be replaced with the same value. — Zizi96
– Zizi96, Commented Nov 28, 2020 at 17:24

Ch3steR · Accepted Answer · 2020-11-28 18:34:32Z

2

You can use pd.Series.str.replace which is similar to re.sub, you can pass a function to repl param.

from random import randint
from string import ascii_uppercase

mapping = {char:str(randint(0, 9)) for char in ascii_uppercase} 
#{'A': '1', 'B': '3', 'C': '1', 'D': '1', 'E': '4', 'F': '9', 'G': '9', 'H': '5',
#'I': '8', 'J': '0', 'K': '0', 'L': '1', 'M': '5', 'N': '3', 'O': '8', 'P': '5', 
#'Q': '5', 'R': '8', 'S': '9', 'T': '8', 'U': '0', 'V': '8', 'W': '1', 'X': '6', 
#'Y': '7', 'Z': '5'}

def repl(match):
    return ''.join(mapping[char] for char in match.group(0))

df['StockCode'] = df['StockCode'].str.replace(r"[A-Z]{,3}$", repl)
df

  InvoiceNo StockCode
0        536     85123
1        536    710535  # 'Z' -> 5
2        536    844063  # 'B' -> 3
3        536    226239  # 'S' -> 9

When there is more than one alphabet at the end.

df = pd.DataFrame({
     'InvoiceNo':536,
     'StockCode':['85123', '71053ZAZ', '84406BAR', '22623BIR']
    })
df['StockCode'] = df['StockCode'].str.replace(r"[A-Z]{,3}$", repl)

   InvoiceNo  StockCode
0        536      85123
1        536   71053515  # 'ZAZ' -> '515'
2        536   84406318  # 'BAR' -> '318'
3        536   22623388  # 'BIR' -> '388'

Details about regex pattern used r"[A-Z]{,3}$" in regex101

edited Nov 28, 2020 at 18:34

answered Nov 28, 2020 at 18:18

Ch3steR

20.8k4 gold badges34 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Zizi96 Over a year ago

This method seems to produce repeat characters seen in mappings, check A and C for example. But I've fixed by using Bens edit and combined with your repl. Thanks for your help, got a robust solution and learnt something along the way.

Zizi96 · Accepted Answer · 2020-11-28 20:34:48Z

1

This should do the trick.

# Make some data
df = pd.DataFrame({
    'InvoiceNo':536,
    'StockCode':['85123', '71053Z', '84406B', '22623S']
})

# Define replacements in a dictionary. (Note the values are strings, not ints)
replacements = {'Z':'1', 'B':'2', 'S':'3'}

#-- Edit per OP's comment ---------------
import string
keys = list(string.ascii_uppercase)
values = [str(i) for i in range(len(keys))]
replacements = dict(zip(keys, values))
#---------------------------------------

# Rebuild Stockcode
# 1) df.StockCode.str.extract('^(\d+)') extracts every sequence of numbers in StockCode, up until the first non-number character
# 2) df.StockCode.str.extract('([A-Z])$') extracts the non-number character at the end of each string
# 3) .replace(replacements).fillna('') makes the replacements and then changes NaN to ''
# 4) adding two series of strings concatenates them
df['StockCode'] = (df.StockCode.str.extract('^(\d+)') + 
                   df.StockCode.str.extract('([A-Z])$').replace(replacements).fillna(''))

print(df)
   InvoiceNo StockCode
0        536     85123
1        536    710531
2        536    844062
3        536    226233

The key to this is about understanding regular expressions.

edited Nov 28, 2020 at 20:34

Zizi96

5494 gold badges8 silver badges30 bronze badges

answered Nov 28, 2020 at 17:00

Ben

21.9k35 gold badges132 silver badges223 bronze badges

3 Comments

Zizi96 Over a year ago

Thanks Ben, is there any way to generate that replacement dictionary on the fly or do I have to manually assign the entire alphabet to a number?

Ben Over a year ago

I thought you wanted a more dynamic replacement mechanism. In that case, just use df['StockCode'] = df.StockCode.str.replace('([A-Z])$', '999') or whatever number you want to use.

Zizi96 Over a year ago

Sorry I was unclear, I do want a dynamic replacement, however, I would like to create this replacement dictionary dynamically. Is there a method that doesn't require making a dictionary containing all the letters i.e replacements = {'A':'1', 'B':'2', 'C':'3'...Z:26}? This would be ideal for managing large df with various combinations of substrings at the end.

Collectives™ on Stack Overflow

How to replace partial strings within a df column

2 Answers 2

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related