0

I would like to find all the rows in a column that contains a unique ID as a string which starts with digits and symbols. After they have been identified, I would like to delete the first 9 characters for those unique rows, only. So far I have:

if '.20_P' in df['ID']:
     df['ID']= df['ID']str.slice[: 9]

where I would like it to take this:

df['ID'] = 
2.2.2020_P18dhwys
2.1.2020_P18dh234
2.4.2020_P18dh229
P18dh209
P18dh219
2.5.2020_P18dh289

and trun it into this:

df['ID'] = 
P18dhwys
P18dh234
P18dh229
P18dh209
P18dh219
P18dh289
1
  • The Series.str.extract() approach will be faster than apply'ing a lambda. Commented Feb 10, 2020 at 20:30

3 Answers 3

1

Do a conditional row-wise apply to the same column:

df['ID'] = df.apply(lambda row: row['ID'][:9] if '.20_P' in row['ID'] else row['ID'], axis=1)
Sign up to request clarification or add additional context in comments.

2 Comments

this worked, thank you. can you explain the lambda function & what this is doing ?
The lambda expression sets the value of the field 'ID' based other rows in the same row. This is just the logic you put in the question, reformatted to be one line: row['ID'][:9] if '.20_P' in row['ID'] else row['ID']
1

You could also use a regular expression to find your substring.

The regular expression here works as follows: Find a substring () consisting of multiple occurrences (+) of digits (\d) or ([]) non whitespace characters (\w). This might (*, ?) be preceded by a combination of digits and dots [\d+\.] with a trailing underscore _. Note that this is also quite fast as it is highly optimized (compared to .apply()). So if you have a lot of data, or do this often this is something you might want to consider.

import pandas as pd

df = pd.DataFrame({'A': [
    '2.2.2020_P18dhwys',
    '2.1.2020_P18dh234',
    '2.4.2020_P18dh229',
    'P18dh209',
    'P18dh219',
    '2.5.2020_P18dh289',
]})

print(df['A'].str.extract(r'[\d+\.]*_?([\d\w]+)'))

Output:

          0
0  P18dhwys
1  P18dh234
2  P18dh229
3  P18dh209
4  P18dh219
5  P18dh289

1 Comment

The Series.str.extract() approach will be faster than apply'ing a lambda.
0

If you know that the string to remove is a prefix added with underscore, you could do

 df['ID']= df['ID'].apply(lambda x: x.split('_')[-1])

1 Comment

thank you for this, but the real IDs have several '_', but thank you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.