1

Suppose I have the following rows in a Pandas DataFrame:

970 P-A1-1019-03-C15,15 23987896    1   8
971 P-A1-1019-06-B15,15 23251711    4   8
972 P-A1-1019-08-C15,15 12160034    2   8
973 P-A1-1020-01-D15,15 8760012     1   8

I'd like to alter the second column to remove the ",15" from the string. Desired end state would be like this:

970 P-A1-1019-03-C15    23987896    1   8
971 P-A1-1019-06-B15    23251711    4   8
972 P-A1-1019-08-C15    12160034    2   8
973 P-A1-1020-01-D15    8760012     1   8

The thing to remove won't always be ",15", as it could be ",10", ",03", ",4", etc. Additionally, some rows in the input are differently formatted, and may look like this:

4   RR00-0,2020338  24380076    4   12
5   RR00-0,2020738  10562767    2   12
6   ,D              24260808    1   12
7   ,D              23521158    1   12

Initially, I'm only interested in the cases where the string DOES fit the form of "P-A1-1019-03-C15", so it would be nice to be able to drop rows which don't match that specific format.

Is there a built in way to do this kind of processing, or will I need to iterate over every row manually?

6
  • 2
    Your first thing is easy just do df['col'] = df['col'].str.replace(',15','') for the second you can filter using a regex expression something like df[df['col'].str.contains(regex)] Commented Mar 13, 2015 at 16:31
  • will that str.replace(',15','') work for the case where the thing to remove is ',11'? Commented Mar 13, 2015 at 16:34
  • 1
    No it looks for exact matches, it depends on how varied your data is, you could just say slice the strings: df['col'] = df['col'].str.[:-3] which will strip the last 3 characters off (I think, I may be off by one) or do this: df['col'] = df['col'].str[:15] if you want the first 16 characters Commented Mar 13, 2015 at 16:36
  • Then I could do the filter with df[df['col'].str.contains(regex)] first, then, once all the strings are uniformly formatted, strip the last three... Commented Mar 13, 2015 at 16:38
  • 1
    Something like that but it may be easier still to say df[df['col'].str.len() >=16] if all the duff values are less than that length, a regex pattern is better so long as it is precise enough to match the data you expect Commented Mar 13, 2015 at 16:40

1 Answer 1

1

This should remove all ',15' values:

dataframe['string column'] = dataframe['string column'].apply(lambda value:[value].split(',')[0] if [value].split(',')[0] == '15' else value)

This should remove all ',15' values if they are in the format you provided:

dataframe['string column'] = dataframe['string column'].apply(lambda value:[value].split(',')[0] if ([value].split(',')[0] == '15') & ('P-A1-' in value) else value)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.