Python - search for pattern within a DataFrame followed by multiple possible strings

Question

I have a dataframe in which one of the columns has a long list of semi-colon separated strings:

gene_id ENSGACG00000019161; gene_version 1; transcript_id ENSGACT00000025386; transcript_version 1; exon_number 9; gene_name slc7a8a; gene_source ensembl; gene_biotype protein_coding; transcript_name slc7a8a-203; transcript_source ensembl; transcript_biotype protein_coding; exon_id ENSGACE00000225405; exon_version 1;

I want to somehow go row by row and pull out just the string that follows gene_name and precedes the semi-colon. So in this case slc7a8a. I'm sorry if this is either a simple question or a repetitive one. I've tried to look through multiple resources but don't even know the most concise way to describe what I want to do had difficulty finding anything helpful.

Thank you

Perfect, thank you! Worked like a charm. I'm used to working with wildcard characters in a very superficial way so thought it would be along those lines. — MStep
– MStep, Commented Mar 11, 2019 at 23:51
No problem. :) I'm going to post this as an answer just in case someone else needs help with a similar question. — panktijk
– panktijk, Commented Mar 12, 2019 at 1:26

panktijk · Accepted Answer · 2019-03-12 01:28:39Z

1

You can use pandas str.extract which takes a regex pattern as an input parameter:

df['col_name'].str.extract('gene_name(.*?);')

answered Mar 12, 2019 at 1:28

panktijk

1,61411 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python - search for pattern within a DataFrame followed by multiple possible strings

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related