1

I have a dataframe in which one of the columns has a long list of semi-colon separated strings:

gene_id ENSGACG00000019161; gene_version 1; transcript_id ENSGACT00000025386; transcript_version 1; exon_number 9; gene_name slc7a8a; gene_source ensembl; gene_biotype protein_coding; transcript_name slc7a8a-203; transcript_source ensembl; transcript_biotype protein_coding; exon_id ENSGACE00000225405; exon_version 1;

I want to somehow go row by row and pull out just the string that follows gene_name and precedes the semi-colon. So in this case slc7a8a. I'm sorry if this is either a simple question or a repetitive one. I've tried to look through multiple resources but don't even know the most concise way to describe what I want to do had difficulty finding anything helpful.

Thank you

3
  • Try this: df['col_name'].str.extract('gene_name(.*?);')? Commented Mar 11, 2019 at 23:40
  • Perfect, thank you! Worked like a charm. I'm used to working with wildcard characters in a very superficial way so thought it would be along those lines. Commented Mar 11, 2019 at 23:51
  • No problem. :) I'm going to post this as an answer just in case someone else needs help with a similar question. Commented Mar 12, 2019 at 1:26

1 Answer 1

1

You can use pandas str.extract which takes a regex pattern as an input parameter:

df['col_name'].str.extract('gene_name(.*?);')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.