Pandas extracting values from rows based on set of strings

Question

I'm trying to extract specific values(in form of key:value pairs) from a pandas column which has multiple semicolon separated pairs.

The input dataframe looks like this:

9   114188457   114192289   cast_3_930|cast_1_1069|cast_2_985   0.9510007336163186  -   114188457   114188457   211,111,111 "gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; exon_number ""23""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001401544""; exon_version ""1""; tag ""basic""; transcript_support_level ""5"";"  .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""26""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001400969""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""25""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001404576""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .

and I'm working on 10th column, which looks like this:

"gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; tag ""basic""; transcript_support_level ""5"";"

With pairs in format: identifier ""value""

While I can extract the values by converting that column into another dataframe and selecting the relevant rows, the problems is that the data in that column itself is not sorted properly.

I'm just interested in gene_id, gene_name and gene_biotype in this case, but in future might alter the specifications on the required terms. I could have worked on a dictionary based solution, but the values are not unique for each row, and in some rows they don't exist at all (rows with . for column 10).

Ultimately, I want the dataframe to look like this:

9   114188457   114192289   cast_3_930|cast_1_1069|cast_2_985   0.9510007336163186  -   114188457   114188457   211,111,111 ENSMUSG00000111734  Gm29825 lincRNA .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 ENSMUSG00000064299  4921528I07Rik   processed_transcript    .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 ENSMUSG00000064299  4921528I07Rik   processed_transcript    .

What would be the most efficient way to do this in pandas ?

You might get more answers if you can edit down your sample so it's easier for people to see what you're doing, removing everything but what's needed to reproduce the issue. — ASGM
– ASGM, Commented Mar 29, 2018 at 14:51
@ASGM, thanks for the suggestion! Downsampled the dataframe. — Siddharth
– Siddharth, Commented Mar 29, 2018 at 14:58

michaelg · Accepted Answer · 2018-03-29 14:56:44Z

1

Regex on pandas column

You can use a regex expression after the .str parameter on a column

df['gene_id'] = df.iloc[:,9].str.extract('gene_id \"(\w+)\";')
df['gene_name'] = df.iloc[:,9].str.extract('gene_name \"(\w+)\";')
df['gene_biotype'] =df.iloc[:,9].str.extract('gene_biotype \"(\w+)\";')

answered Mar 29, 2018 at 14:56

michaelg

9546 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas extracting values from rows based on set of strings

1 Answer 1

Regex on pandas column

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Regex on pandas column

Comments

Your Answer

Sign up or log in

Post as a guest

Related