Finding and deleting sub-strings in dataframe column Python

Question

I would like to find all the rows in a column that contains a unique ID as a string which starts with digits and symbols. After they have been identified, I would like to delete the first 9 characters for those unique rows, only. So far I have:

if '.20_P' in df['ID']:
     df['ID']= df['ID']str.slice[: 9]

where I would like it to take this:

df['ID'] = 
2.2.2020_P18dhwys
2.1.2020_P18dh234
2.4.2020_P18dh229
P18dh209
P18dh219
2.5.2020_P18dh289

and trun it into this:

df['ID'] = 
P18dhwys
P18dh234
P18dh229
P18dh209
P18dh219
P18dh289

The Series.str.extract() approach will be faster than apply'ing a lambda. — smci
– smci, Commented Feb 10, 2020 at 20:30

Dave · Accepted Answer · 2020-02-10 20:20:04Z

1

Do a conditional row-wise apply to the same column:

df['ID'] = df.apply(lambda row: row['ID'][:9] if '.20_P' in row['ID'] else row['ID'], axis=1)

edited Feb 10, 2020 at 20:20

answered Feb 10, 2020 at 19:40

Dave

2,03117 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ramy Saad Over a year ago

this worked, thank you. can you explain the lambda function & what this is doing ?

Dave Over a year ago

The lambda expression sets the value of the field 'ID' based other rows in the same row. This is just the logic you put in the question, reformatted to be one line: row['ID'][:9] if '.20_P' in row['ID'] else row['ID']

BStadlbauer · Accepted Answer · 2020-02-10 19:56:53Z

1

You could also use a regular expression to find your substring.

The regular expression here works as follows: Find a substring () consisting of multiple occurrences (+) of digits (\d) or ([]) non whitespace characters (\w). This might (*, ?) be preceded by a combination of digits and dots [\d+\.] with a trailing underscore _. Note that this is also quite fast as it is highly optimized (compared to .apply()). So if you have a lot of data, or do this often this is something you might want to consider.

import pandas as pd

df = pd.DataFrame({'A': [
    '2.2.2020_P18dhwys',
    '2.1.2020_P18dh234',
    '2.4.2020_P18dh229',
    'P18dh209',
    'P18dh219',
    '2.5.2020_P18dh289',
]})

print(df['A'].str.extract(r'[\d+\.]*_?([\d\w]+)'))

Output:

          0
0  P18dhwys
1  P18dh234
2  P18dh229
3  P18dh209
4  P18dh219
5  P18dh289

edited Feb 10, 2020 at 19:56

answered Feb 10, 2020 at 19:44

BStadlbauer

1,2856 silver badges19 bronze badges

1 Comment

smci Over a year ago

The Series.str.extract() approach will be faster than apply'ing a lambda.

onodi · Accepted Answer · 2020-02-10 19:41:13Z

0

If you know that the string to remove is a prefix added with underscore, you could do

 df['ID']= df['ID'].apply(lambda x: x.split('_')[-1])

answered Feb 10, 2020 at 19:41

onodi

563 bronze badges

1 Comment

Ramy Saad Over a year ago

thank you for this, but the real IDs have several '_', but thank you

Collectives™ on Stack Overflow

Finding and deleting sub-strings in dataframe column Python

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related