1

I have an excel file with 2 columns. I want to remove some parts of the string in column 2 (C2). But the problem is, as the file is huge I don’t know the exact name that I want to remove; however I know which parts I want to keep. The other issue is the part I want to keep is only one letter which they will be in the part I want to remove too. The following is an example:

The original part is like:(C1 and C2 are columns name)

C1              C2
T1              L_1>K>J>P000RTK>P
T2              K>L>L>PY0BDJS
T3              P>P>P000FTKL>L

I need the results like: I only want to keep the part with one letter and I 
want to remove the rest.

C1              C2
T1              L_1>K>J>P
T2              K>L>L
T3              P>P>L

Thanks

2
  • Can you show the code you've attempted this with and what problems you ran into? And why is this tagged 'pandas'? Commented May 8, 2017 at 3:55
  • @pvg was kind enough to point out that my previous answer was inaccurate. I've updated the answer such that it is now accurate. My apologies for missing a nuance of the problem. Commented May 8, 2017 at 16:11

2 Answers 2

3

If you are using pandas DataFrame to read the file, you can use pd.replace() with regex on your DataFrame to remove unwanted cell values.

>> df.replace(">\w{2,}", "", regex=True)

C1  C2
0   T1  L_1>K>J>P
1   T2  K>L>L
2   T3  P>P>L

Disclaimer: There are cases the regex I've used may fail, such as P000RTK>L_1>K>J>P (thanks @piRSquared for pointing it out). This was given as an example (using values from the question), you need to implement your own regex (the one that fits your needs) when using replace with regex.

Sign up to request clarification or add additional context in comments.

2 Comments

Notice that you would not be able to remove the first instance of the condition. 'P000RTK>L_1>K>J>P' would not get filtered to 'L_1>K>J>P'
@piRSquared you are right, that specifically can be fixed with an or operator, but there still are many cases it could fail. Couldn't really understand what is expected, thus worked with the example so that OP or others can implement their own regex. I'll add a disclaimer, thanks.
2

According to your condition, you want to keep only those parts that contain one letter. That implies you want to remove things like

  • 'P_K': non-contiguous multiple letters
  • 'PK_': contiguous multiple letters

My strategy is to split strings by '>' and filter out those elements whose letter counts exceed 1

f = lambda x: x.str.count('[A-Za-z]') < 2
s = df.C2.str.split('>', expand=True).stack()
df.assign(C2=s.compress(f).groupby(level=0).apply('>'.join))

   C1         C2
0  T1  L_1>K>J>P
1  T2      K>L>L
2  T3      P>P>L

4 Comments

This happens to work for the sample output but doesn't actually do what the question asks which is remove the sub-fields that only have one letter.
@pvg thanks for pointing that out. If you are my down vote, please reconsider after my latest edit.
I upvoted it for the fix to correctness and the glorious rubegoldbergry although I have to wonder whether it's entirely necessary, however educational. For one thing, a regex can match '2 or more of' natively without the need of a count.
@pvg the reason I went with count is to account for the non contiguous letters like 'P__023__K'. The code I provided for counting the 'P' and 'K' seems intuitive and straight forward. The regex to do the same is not. umutto's solution doesn't account for this '\w{2,}' finds 'PK' but does not find 'P_K'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.