3

I'm having trouble getting a string in pandas to be replaced in the correct manner. I'm not sure if I'm limited to using pandas and there might not be a way to do this with solely using pandas.

This is how my dataframe looks:

 (ID: 10)              247333605                      0.0  
  (ID: 20)               36738870                      0.0  
  (ID: 40)             4668036427                      0.0  
  (ID: 50)             1918647972                      0.0  
  (ID: 60)             4323165902                  44125.0  
  (ID: 80)              145512255                      0.0  
 Assigned (ID: 30)       42050340                      0.0  
 Assigned (ID: 40)   130880371376                      0.0  
 Assigning (ID: 30)    1095844753                      0.0  
 Cancelled (ID: 40)        937280                      0.0  
 Cancelled (ID: 80)   16857720813                      0.0  
 Planned (ID: 20)      9060392597                      0.0  
 Planning (ID: 10)   108484297031                      0.0  
 Processed (ID: 70)  133289880880                      0.0  
 Revoked (ID: 50)      2411903072                      0.0  
 Writing (ID: 50)    146408550024                      0.0  
 Written (ID: 60)    139458227923                1018230.0  

For each (ID: x), it should be matched to the assigned (ID: x), cancelled (ID: x), etc with the correct ID.

Using lines similar to this line:

input_data['last_status'] = input_data.last_status.str.replace('(ID: 10)', 'Planning (ID: 10)')

My output is:

(Assigned (ID: 40))                                0.0  
  (Cancelled (ID: 80))                               0.0  
  (Planned (ID: 20))                                 0.0  
  (Planning (ID: 10))                                0.0  
  (Writing (ID: 50))                                 0.0  
  (Written (ID: 60))                             44125.0  
 Assigned (Assigned (ID: 40))                        0.0  
 Assigned (ID: 30)                                   0.0  
 Assigning (ID: 30)                                  0.0  
 Cancelled (Assigned (ID: 40))                       0.0  
 Cancelled (Cancelled (ID: 80))                      0.0  
 Planned (Planned (ID: 20))                          0.0  
 Planning (Planning (ID: 10))                        0.0  
 Processed (ID: 70)                                  0.0  
 Revoked (Writing (ID: 50))                          0.0  
 Writing (Writing (ID: 50))                          0.0  
 Written (Written (ID: 60))                    1018230.0  

As you can see, all the (ID: x) got replaced and it still doesn't match the correct term.

My ideal dataframe would look like this:

 Assigned (ID: 30)       42050340                      0.0  
 Assigned (ID: 40)   130880371376                      0.0  
 Assigning (ID: 30)    1095844753                      0.0  
 Cancelled (ID: 40)        937280                      0.0  
 Cancelled (ID: 80)   16857720813                      0.0  
 Planned (ID: 20)      9060392597                      0.0  
 Planning (ID: 10)   108484297031                      0.0  
 Processed (ID: 70)  133289880880                      0.0  
 Revoked (ID: 50)      2411903072                      0.0  
 Writing (ID: 50)    146408550024                      0.0  
 Written (ID: 60)    139458227923                1018230.0 

I'm bound to using pandas because the dataset is huge, I have a different implementation but they take me days to run. Is there a way to do this right in pandas?

I've never asked something before on stackoverflow. I hope my question is clear.

1 Answer 1

3

If you want to generalise, you can use str.replace with SOL/EOL anchors.

df['last_status'].str.replace(r'^(\(ID: \d+\))$', r'Planning: \1')

0     Planning: (ID: 10)
1     Planning: (ID: 20)
2     Planning: (ID: 40)
3     Planning: (ID: 50)
4     Planning: (ID: 60)
5     Planning: (ID: 80)
6      Assigned (ID: 30)
7      Assigned (ID: 40)
8     Assigning (ID: 30)
9     Cancelled (ID: 40)
10    Cancelled (ID: 80)
11      Planned (ID: 20)
12     Planning (ID: 10)
13    Processed (ID: 70)
14      Revoked (ID: 50)
15      Writing (ID: 50)
16      Written (ID: 60)
Name: last_status, dtype: object

If you want to replace a specific ID only, change your regex to -

r'^(\(ID: 10\))$'

Or,

r'^(\(ID: {}\))$'.format(number)

Where number is a variable that holds the ID value to perform replacement on.

Sign up to request clarification or add additional context in comments.

3 Comments

dude Nice one :-)
I have tried all the options, but they do not seem to have any effect on my dataset. (ID: x) has a space in front of it, does that affect the regex?
input_data['last_status'].str.find(r'^((ID: \d+))$'), gives no results... nor do the other regex inputs

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.