Select Pandas rows with regex match

Question

I have the following data-frame.

and I have an input list of values

I want to match each item from the input list to the Symbol and Synonym column in the data-frame and to extract only those rows where the input value appears in either the Symbol column or Synonym column(Please note that here the values are separated by '|' symbol).

In the output data-frame I need an additional column Input_symbol which denotes the matching value. So here in this case the desired output will should be like the image bellow.

How can I do the same ?.

Why is A2MP1 excluded in output?

Zero
– Zero

2018-02-19 11:57:45 +00:00
Commented Feb 19, 2018 at 11:57 — Zero
– Zero, Commented Feb 19, 2018 at 11:57

Zero · Accepted Answer · 2018-02-19 11:59:16Z

3

IIUIC, use

In [346]: df[df.Synonyms.str.contains('|'.join(mylist))]
Out[346]:
     Symbol                   Synonyms
0      A1BG       A1B|ABG|GAB|HYST2477
1       A2M  A2MD|CPAMD5|FWP007|S863-7
2     A2MP1                       A2MP
6  SERPINA3       AACT|ACT|GIG24|GIG25

answered Feb 19, 2018 at 11:59

Zero

77.4k22 gold badges153 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jezrael Over a year ago

OP need

I want to extract only those rows from the data-frame where the strings in mylist appears in Symbol column or Synonym column(here it is separated by '|' symbol).

- so is necessary check both columns :(

jezrael · Accepted Answer · 2018-02-19 12:16:27Z

Check in both columns by str.contains and chain conditions by | (or), last filter by boolean indexing:

mylist = ['GAB', 'A2M', 'GIG24']
m1 = df.Synonyms.str.contains('|'.join(mylist))
m2 = df.Symbol.str.contains('|'.join(mylist))

df = df[m1 | m2]

Another solution is logical_or.reduce all masks created by list comprehension:

masks = [df[x].str.contains('|'.join(mylist)) for x in ['Symbol','Synonyms']]
m = np.logical_or.reduce(masks)

Or by apply, then use DataFrame.any for check at least one True per row:

m = df[['Symbol','Synonyms']].apply(lambda x: x.str.contains('|'.join(mylist))).any(1)

df = df[m]

print (df)
     Symbol                   Synonyms
0      A1BG       A1B|ABG|GAB|HYST2477
1       A2M  A2MD|CPAMD5|FWP007|S863-7
2     A2MP1                       A2MP
6  SERPINA3       AACT|ACT|GIG24|GIG25

Anton vBR · Accepted Answer · 2018-02-23 17:46:22Z

2

The question has changed. What you want to do now is to look through the two columns (Symbol and Synonyms) and if you find a value that is inside mylist return it. If no match you can return 'No match!' (for instance).

import pandas as pd
import io

s = '''\
Symbol,Synonyms
A1BG,A1B|ABG|GAB|HYST2477
A2M,A2MD|CPAMD5|FWP007|S863-7
A2MP1,A2MP
NAT1,AAC1|MNAT|NAT-1|NATI
NAT2,AAC2|NAT-2|PNAT
NATP,AACP|NATP1
SERPINA3,AACT|ACT|GIG24|GIG25'''

mylist = ['GAB', 'A2M', 'GIG24']
df = pd.read_csv(io.StringIO(s))

# Store the lookup serie
lookup_serie = df['Symbol'].str.cat(df['Synonyms'],'|').str.split('|')

# Create lambda function to return first value from mylist, No match! if stop-iteration
f = lambda x: next((i for i in x if i in mylist), 'No match!')

df.insert(0,'Input_Symbol',lookup_serie.apply(f))
print(df)

Returns

  Input_Symbol    Symbol                   Synonyms
0          GAB      A1BG       A1B|ABG|GAB|HYST2477
1          A2M       A2M  A2MD|CPAMD5|FWP007|S863-7
2    No match!     A2MP1                       A2MP
3    No match!      NAT1       AAC1|MNAT|NAT-1|NATI
4    No match!      NAT2            AAC2|NAT-2|PNAT
5    No match!      NATP                 AACP|NATP1
6        GIG24  SERPINA3       AACT|ACT|GIG24|GIG25

Old solution:

f = lambda x: [i for i in x.split('|') if i in mylist] != []

m1 = df['Symbol'].apply(f)
m2 = df['Synonyms'].apply(f)

df[m1 | m2]

edited Feb 23, 2018 at 17:46

answered Feb 19, 2018 at 12:41

Anton vBR

19k6 gold badges47 silver badges47 bronze badges

4 Comments

Anton vBR Over a year ago

@jezrael What do you think about this? I am not sure how I can make the lookup_serie more readable.

Sajil C K Over a year ago

Thanks a lot. This almost solves my problem. A slight change can be since my actual data-frame is around 60,100 rows and input list has around 9000 items, is there a way we can make it bit more time efficient?. Also I don't need the no matching rows. Would that make it more time efficient ?

Anton vBR Over a year ago

@SAJILCK Yeah well you can change it to None or maybe '' but shouldn't give you a big boost. I can't think of something that would speed it up unfortunately.

jezrael Over a year ago

@AntonvBR - Not sure if possible boost it. In my opinion regex solution should failed, because too many values.

Collectives™ on Stack Overflow

Select Pandas rows with regex match

3 Answers 3

1 Comment

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related