0

I'm trying to extract specific values(in form of key:value pairs) from a pandas column which has multiple semicolon separated pairs.

The input dataframe looks like this:

9   114188457   114192289   cast_3_930|cast_1_1069|cast_2_985   0.9510007336163186  -   114188457   114188457   211,111,111 "gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; exon_number ""23""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001401544""; exon_version ""1""; tag ""basic""; transcript_support_level ""5"";"  .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""26""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001400969""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""25""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001404576""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .

and I'm working on 10th column, which looks like this:

"gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; tag ""basic""; transcript_support_level ""5"";"

With pairs in format: identifier ""value""

While I can extract the values by converting that column into another dataframe and selecting the relevant rows, the problems is that the data in that column itself is not sorted properly.

I'm just interested in gene_id, gene_name and gene_biotype in this case, but in future might alter the specifications on the required terms. I could have worked on a dictionary based solution, but the values are not unique for each row, and in some rows they don't exist at all (rows with . for column 10).

Ultimately, I want the dataframe to look like this:

9   114188457   114192289   cast_3_930|cast_1_1069|cast_2_985   0.9510007336163186  -   114188457   114188457   211,111,111 ENSMUSG00000111734  Gm29825 lincRNA .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 ENSMUSG00000064299  4921528I07Rik   processed_transcript    .
9   114227850   114241851   cast_3_932|cast_1_1071|cast_2_988   1.2516483862692769  +   114227850   114227850   211,111,111 ENSMUSG00000064299  4921528I07Rik   processed_transcript    .

What would be the most efficient way to do this in pandas ?

2
  • You might get more answers if you can edit down your sample so it's easier for people to see what you're doing, removing everything but what's needed to reproduce the issue. Commented Mar 29, 2018 at 14:51
  • @ASGM, thanks for the suggestion! Downsampled the dataframe. Commented Mar 29, 2018 at 14:58

1 Answer 1

1

Regex on pandas column

You can use a regex expression after the .str parameter on a column

df['gene_id'] = df.iloc[:,9].str.extract('gene_id \"(\w+)\";')
df['gene_name'] = df.iloc[:,9].str.extract('gene_name \"(\w+)\";')
df['gene_biotype'] =df.iloc[:,9].str.extract('gene_biotype \"(\w+)\";')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.