Remove unknown part of string in python

Question

I have an excel file with 2 columns. I want to remove some parts of the string in column 2 (C2). But the problem is, as the file is huge I don’t know the exact name that I want to remove; however I know which parts I want to keep. The other issue is the part I want to keep is only one letter which they will be in the part I want to remove too. The following is an example:

The original part is like:(C1 and C2 are columns name)

C1              C2
T1              L_1>K>J>P000RTK>P
T2              K>L>L>PY0BDJS
T3              P>P>P000FTKL>L

I need the results like: I only want to keep the part with one letter and I 
want to remove the rest.

C1              C2
T1              L_1>K>J>P
T2              K>L>L
T3              P>P>L

Thanks

Can you show the code you've attempted this with and what problems you ran into? And why is this tagged 'pandas'? — pvg
– pvg, Commented May 8, 2017 at 3:55
@pvg was kind enough to point out that my previous answer was inaccurate. I've updated the answer such that it is now accurate. My apologies for missing a nuance of the problem. — piRSquared
– piRSquared, Commented May 8, 2017 at 16:11

umutto · Accepted Answer · 2017-05-09 02:03:39Z

3

If you are using pandas DataFrame to read the file, you can use pd.replace() with regex on your DataFrame to remove unwanted cell values.

>> df.replace(">\w{2,}", "", regex=True)

C1  C2
0   T1  L_1>K>J>P
1   T2  K>L>L
2   T3  P>P>L

Disclaimer: There are cases the regex I've used may fail, such as P000RTK>L_1>K>J>P (thanks @piRSquared for pointing it out). This was given as an example (using values from the question), you need to implement your own regex (the one that fits your needs) when using replace with regex.

edited May 9, 2017 at 2:03

answered May 8, 2017 at 4:01

umutto

7,7004 gold badges47 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

piRSquared Over a year ago

Notice that you would not be able to remove the first instance of the condition. 'P000RTK>L_1>K>J>P' would not get filtered to 'L_1>K>J>P'

umutto Over a year ago

@piRSquared you are right, that specifically can be fixed with an or operator, but there still are many cases it could fail. Couldn't really understand what is expected, thus worked with the example so that OP or others can implement their own regex. I'll add a disclaimer, thanks.

piRSquared · Accepted Answer · 2017-05-08 16:06:52Z

2

According to your condition, you want to keep only those parts that contain one letter. That implies you want to remove things like

'P_K': non-contiguous multiple letters
'PK_': contiguous multiple letters

My strategy is to split strings by '>' and filter out those elements whose letter counts exceed 1

f = lambda x: x.str.count('[A-Za-z]') < 2
s = df.C2.str.split('>', expand=True).stack()
df.assign(C2=s.compress(f).groupby(level=0).apply('>'.join))

   C1         C2
0  T1  L_1>K>J>P
1  T2      K>L>L
2  T3      P>P>L

edited May 8, 2017 at 16:06

answered May 8, 2017 at 4:00

piRSquared

296k68 gold badges509 silver badges654 bronze badges

4 Comments

pvg Over a year ago

This happens to work for the sample output but doesn't actually do what the question asks which is remove the sub-fields that only have one letter.

piRSquared Over a year ago

@pvg thanks for pointing that out. If you are my down vote, please reconsider after my latest edit.

pvg Over a year ago

I upvoted it for the fix to correctness and the glorious rubegoldbergry although I have to wonder whether it's entirely necessary, however educational. For one thing, a regex can match '2 or more of' natively without the need of a count.

piRSquared Over a year ago

@pvg the reason I went with count is to account for the non contiguous letters like 'P__023__K'. The code I provided for counting the 'P' and 'K' seems intuitive and straight forward. The regex to do the same is not. umutto's solution doesn't account for this '\w{2,}' finds 'PK' but does not find 'P_K'

Collectives™ on Stack Overflow

Remove unknown part of string in python

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related