How to modify strings in a pandas dataframe (regex?)

Question

Suppose I have the following rows in a Pandas DataFrame:

970 P-A1-1019-03-C15,15 23987896    1   8
971 P-A1-1019-06-B15,15 23251711    4   8
972 P-A1-1019-08-C15,15 12160034    2   8
973 P-A1-1020-01-D15,15 8760012     1   8

I'd like to alter the second column to remove the ",15" from the string. Desired end state would be like this:

970 P-A1-1019-03-C15    23987896    1   8
971 P-A1-1019-06-B15    23251711    4   8
972 P-A1-1019-08-C15    12160034    2   8
973 P-A1-1020-01-D15    8760012     1   8

The thing to remove won't always be ",15", as it could be ",10", ",03", ",4", etc. Additionally, some rows in the input are differently formatted, and may look like this:

4   RR00-0,2020338  24380076    4   12
5   RR00-0,2020738  10562767    2   12
6   ,D              24260808    1   12
7   ,D              23521158    1   12

Initially, I'm only interested in the cases where the string DOES fit the form of "P-A1-1019-03-C15", so it would be nice to be able to drop rows which don't match that specific format.

Is there a built in way to do this kind of processing, or will I need to iterate over every row manually?

Your first thing is easy just do df['col'] = df['col'].str.replace(',15','') for the second you can filter using a regex expression something like df[df['col'].str.contains(regex)] — EdChum
– EdChum, Commented Mar 13, 2015 at 16:31
will that str.replace(',15','') work for the case where the thing to remove is ',11'? — PTTHomps
– PTTHomps, Commented Mar 13, 2015 at 16:34
No it looks for exact matches, it depends on how varied your data is, you could just say slice the strings: df['col'] = df['col'].str.[:-3] which will strip the last 3 characters off (I think, I may be off by one) or do this: df['col'] = df['col'].str[:15] if you want the first 16 characters — EdChum
– EdChum, Commented Mar 13, 2015 at 16:36
Then I could do the filter with df[df['col'].str.contains(regex)] first, then, once all the strings are uniformly formatted, strip the last three... — PTTHomps
– PTTHomps, Commented Mar 13, 2015 at 16:38
Something like that but it may be easier still to say df[df['col'].str.len() >=16] if all the duff values are less than that length, a regex pattern is better so long as it is precise enough to match the data you expect — EdChum
– EdChum, Commented Mar 13, 2015 at 16:40

kennes · Accepted Answer · 2015-03-13 16:58:44Z

1

This should remove all ',15' values:

dataframe['string column'] = dataframe['string column'].apply(lambda value:[value].split(',')[0] if [value].split(',')[0] == '15' else value)

This should remove all ',15' values if they are in the format you provided:

dataframe['string column'] = dataframe['string column'].apply(lambda value:[value].split(',')[0] if ([value].split(',')[0] == '15') & ('P-A1-' in value) else value)

edited Mar 13, 2015 at 16:58

answered Mar 13, 2015 at 16:47

kennes

2,16519 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to modify strings in a pandas dataframe (regex?)

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related