1

I have the following table. Some entries within the StockCode column have letters at the end, how do I replace these with a number?

I could make a dictionary to map each letter to a number but I feel there a quicker way to do so.

InvoiceNo  |StockCode |
0   536     85123   
1   536     71053Z  
2   536     84406B  
3   536     22623S  
7
  • 1
    with what number exactly? or can should it be with random number? Commented Nov 28, 2020 at 16:51
  • 1
    A dictionary to map the replacements is a good idea. Then write a little function to do the replacement and use apply() to map the function to each value in the column. pandas.pydata.org/pandas-docs/stable/reference/api/… Commented Nov 28, 2020 at 16:57
  • @Ch3steR One number corresponding to each letter Commented Nov 28, 2020 at 17:11
  • I understood that but with what number do what to replace? Let's say to want to replace z, with what number you want to replace z? 1?..2?...3??...4??? or you want to replace with any number randomly? Commented Nov 28, 2020 at 17:18
  • @Ch3steR Ah my bad, random as fine as long as it's consistent. so z should always be replaced with the same value. Commented Nov 28, 2020 at 17:24

2 Answers 2

2

You can use pd.Series.str.replace which is similar to re.sub, you can pass a function to repl param.

from random import randint
from string import ascii_uppercase

mapping = {char:str(randint(0, 9)) for char in ascii_uppercase} 
#{'A': '1', 'B': '3', 'C': '1', 'D': '1', 'E': '4', 'F': '9', 'G': '9', 'H': '5',
#'I': '8', 'J': '0', 'K': '0', 'L': '1', 'M': '5', 'N': '3', 'O': '8', 'P': '5', 
#'Q': '5', 'R': '8', 'S': '9', 'T': '8', 'U': '0', 'V': '8', 'W': '1', 'X': '6', 
#'Y': '7', 'Z': '5'}

def repl(match):
    return ''.join(mapping[char] for char in match.group(0))

df['StockCode'] = df['StockCode'].str.replace(r"[A-Z]{,3}$", repl)
df

  InvoiceNo StockCode
0        536     85123
1        536    710535  # 'Z' -> 5
2        536    844063  # 'B' -> 3
3        536    226239  # 'S' -> 9

When there is more than one alphabet at the end.

df = pd.DataFrame({
     'InvoiceNo':536,
     'StockCode':['85123', '71053ZAZ', '84406BAR', '22623BIR']
    })
df['StockCode'] = df['StockCode'].str.replace(r"[A-Z]{,3}$", repl)

   InvoiceNo  StockCode
0        536      85123
1        536   71053515  # 'ZAZ' -> '515'
2        536   84406318  # 'BAR' -> '318'
3        536   22623388  # 'BIR' -> '388'
Sign up to request clarification or add additional context in comments.

1 Comment

This method seems to produce repeat characters seen in mappings, check A and C for example. But I've fixed by using Bens edit and combined with your repl. Thanks for your help, got a robust solution and learnt something along the way.
1

This should do the trick.

# Make some data
df = pd.DataFrame({
    'InvoiceNo':536,
    'StockCode':['85123', '71053Z', '84406B', '22623S']
})

# Define replacements in a dictionary. (Note the values are strings, not ints)
replacements = {'Z':'1', 'B':'2', 'S':'3'}

#-- Edit per OP's comment ---------------
import string
keys = list(string.ascii_uppercase)
values = [str(i) for i in range(len(keys))]
replacements = dict(zip(keys, values))
#---------------------------------------

# Rebuild Stockcode
# 1) df.StockCode.str.extract('^(\d+)') extracts every sequence of numbers in StockCode, up until the first non-number character
# 2) df.StockCode.str.extract('([A-Z])$') extracts the non-number character at the end of each string
# 3) .replace(replacements).fillna('') makes the replacements and then changes NaN to ''
# 4) adding two series of strings concatenates them
df['StockCode'] = (df.StockCode.str.extract('^(\d+)') + 
                   df.StockCode.str.extract('([A-Z])$').replace(replacements).fillna(''))

print(df)
   InvoiceNo StockCode
0        536     85123
1        536    710531
2        536    844062
3        536    226233

The key to this is about understanding regular expressions.

3 Comments

Thanks Ben, is there any way to generate that replacement dictionary on the fly or do I have to manually assign the entire alphabet to a number?
I thought you wanted a more dynamic replacement mechanism. In that case, just use df['StockCode'] = df.StockCode.str.replace('([A-Z])$', '999') or whatever number you want to use.
Sorry I was unclear, I do want a dynamic replacement, however, I would like to create this replacement dictionary dynamically. Is there a method that doesn't require making a dictionary containing all the letters i.e replacements = {'A':'1', 'B':'2', 'C':'3'...Z:26}? This would be ideal for managing large df with various combinations of substrings at the end.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.