Suppose I have the following rows in a Pandas DataFrame:
970 P-A1-1019-03-C15,15 23987896 1 8
971 P-A1-1019-06-B15,15 23251711 4 8
972 P-A1-1019-08-C15,15 12160034 2 8
973 P-A1-1020-01-D15,15 8760012 1 8
I'd like to alter the second column to remove the ",15" from the string. Desired end state would be like this:
970 P-A1-1019-03-C15 23987896 1 8
971 P-A1-1019-06-B15 23251711 4 8
972 P-A1-1019-08-C15 12160034 2 8
973 P-A1-1020-01-D15 8760012 1 8
The thing to remove won't always be ",15", as it could be ",10", ",03", ",4", etc. Additionally, some rows in the input are differently formatted, and may look like this:
4 RR00-0,2020338 24380076 4 12
5 RR00-0,2020738 10562767 2 12
6 ,D 24260808 1 12
7 ,D 23521158 1 12
Initially, I'm only interested in the cases where the string DOES fit the form of "P-A1-1019-03-C15", so it would be nice to be able to drop rows which don't match that specific format.
Is there a built in way to do this kind of processing, or will I need to iterate over every row manually?
df['col'] = df['col'].str.replace(',15','')for the second you can filter using a regex expression something likedf[df['col'].str.contains(regex)]df['col'] = df['col'].str.[:-3]which will strip the last 3 characters off (I think, I may be off by one) or do this:df['col'] = df['col'].str[:15]if you want the first 16 charactersdf[df['col'].str.contains(regex)]first, then, once all the strings are uniformly formatted, strip the last three...df[df['col'].str.len() >=16]if all the duff values are less than that length, a regex pattern is better so long as it is precise enough to match the data you expect