1

I have the following data-frame.

enter image description here

and I have an input list of values

enter image description here

I want to match each item from the input list to the Symbol and Synonym column in the data-frame and to extract only those rows where the input value appears in either the Symbol column or Synonym column(Please note that here the values are separated by '|' symbol).

In the output data-frame I need an additional column Input_symbol which denotes the matching value. So here in this case the desired output will should be like the image bellow.

How can I do the same ?.

enter image description here

1
  • Why is A2MP1 excluded in output? Commented Feb 19, 2018 at 11:57

3 Answers 3

3

IIUIC, use

In [346]: df[df.Synonyms.str.contains('|'.join(mylist))]
Out[346]:
     Symbol                   Synonyms
0      A1BG       A1B|ABG|GAB|HYST2477
1       A2M  A2MD|CPAMD5|FWP007|S863-7
2     A2MP1                       A2MP
6  SERPINA3       AACT|ACT|GIG24|GIG25
Sign up to request clarification or add additional context in comments.

1 Comment

OP need I want to extract only those rows from the data-frame where the strings in mylist appears in Symbol column or Synonym column(here it is separated by '|' symbol). - so is necessary check both columns :(
3

Check in both columns by str.contains and chain conditions by | (or), last filter by boolean indexing:

mylist = ['GAB', 'A2M', 'GIG24']
m1 = df.Synonyms.str.contains('|'.join(mylist))
m2 = df.Symbol.str.contains('|'.join(mylist))

df = df[m1 | m2]

Another solution is logical_or.reduce all masks created by list comprehension:

masks = [df[x].str.contains('|'.join(mylist)) for x in ['Symbol','Synonyms']]
m = np.logical_or.reduce(masks)

Or by apply, then use DataFrame.any for check at least one True per row:

m = df[['Symbol','Synonyms']].apply(lambda x: x.str.contains('|'.join(mylist))).any(1)

df = df[m]

print (df)
     Symbol                   Synonyms
0      A1BG       A1B|ABG|GAB|HYST2477
1       A2M  A2MD|CPAMD5|FWP007|S863-7
2     A2MP1                       A2MP
6  SERPINA3       AACT|ACT|GIG24|GIG25

Comments

2

The question has changed. What you want to do now is to look through the two columns (Symbol and Synonyms) and if you find a value that is inside mylist return it. If no match you can return 'No match!' (for instance).

import pandas as pd
import io

s = '''\
Symbol,Synonyms
A1BG,A1B|ABG|GAB|HYST2477
A2M,A2MD|CPAMD5|FWP007|S863-7
A2MP1,A2MP
NAT1,AAC1|MNAT|NAT-1|NATI
NAT2,AAC2|NAT-2|PNAT
NATP,AACP|NATP1
SERPINA3,AACT|ACT|GIG24|GIG25'''

mylist = ['GAB', 'A2M', 'GIG24']
df = pd.read_csv(io.StringIO(s))

# Store the lookup serie
lookup_serie = df['Symbol'].str.cat(df['Synonyms'],'|').str.split('|')

# Create lambda function to return first value from mylist, No match! if stop-iteration
f = lambda x: next((i for i in x if i in mylist), 'No match!')

df.insert(0,'Input_Symbol',lookup_serie.apply(f))
print(df)

Returns

  Input_Symbol    Symbol                   Synonyms
0          GAB      A1BG       A1B|ABG|GAB|HYST2477
1          A2M       A2M  A2MD|CPAMD5|FWP007|S863-7
2    No match!     A2MP1                       A2MP
3    No match!      NAT1       AAC1|MNAT|NAT-1|NATI
4    No match!      NAT2            AAC2|NAT-2|PNAT
5    No match!      NATP                 AACP|NATP1
6        GIG24  SERPINA3       AACT|ACT|GIG24|GIG25

Old solution:

f = lambda x: [i for i in x.split('|') if i in mylist] != []

m1 = df['Symbol'].apply(f)
m2 = df['Synonyms'].apply(f)

df[m1 | m2]

4 Comments

@jezrael What do you think about this? I am not sure how I can make the lookup_serie more readable.
Thanks a lot. This almost solves my problem. A slight change can be since my actual data-frame is around 60,100 rows and input list has around 9000 items, is there a way we can make it bit more time efficient?. Also I don't need the no matching rows. Would that make it more time efficient ?
@SAJILCK Yeah well you can change it to None or maybe '' but shouldn't give you a big boost. I can't think of something that would speed it up unfortunately.
@AntonvBR - Not sure if possible boost it. In my opinion regex solution should failed, because too many values.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.