How can I replace a column given a specific condition in python?

Question

I have a dataframe that contains multiple columns as below...

Chr1    Cufflinks   exon    28354206    28354551    .   .   .   gene_id "XLOC_008369"; transcript_id "TCONS_00014347"; exon_number "1"; oId "CUFF.2405.1"; class_code "u"; tss_id "TSS10073";
Chr1    Cufflinks   exon    28785549    28786194    .   .   .   gene_id "XLOC_008370"; transcript_id "TCONS_00014348"; exon_number "1"; oId "CUFF.2441.1"; class_code "u"; tss_id "TSS10074";
Chr1    Cufflinks   exon    29328712    29329210    .   .   .   gene_id "XLOC_008371"; transcript_id "TCONS_00014349"; exon_number "1"; oId "CUFF.2495.1"; class_code "u"; tss_id "TSS10075";
Chr1    Cufflinks   exon    29427951    29428406    .   .   .   gene_id "XLOC_008372"; transcript_id "TCONS_00014350"; exon_number "1"; oId "CUFF.2506.1"; class_code "u"; tss_id "TSS10076";
Chr1    Cufflinks   exon    29460116    29460585    .   .   .   gene_id "XLOC_008373"; transcript_id "TCONS_00014351"; exon_number "1"; oId "CUFF.2509.1"; class_code "u"; tss_id "TSS10077";

What i am trying to do is, if any of the items in my list is present in one of the column of the dataframe, then i replace the 2nd column from Cufflinks to lincRNA.

One problem is the column that i am using for making the key in the dictionary has multiple rows in the dataframe and because of that i am getting only unique key and so the total number of rows that are outputted are not the same as the input.

Here is my code so far...

#!/usr/bin/env python

file_in = open("lincRNA_final_transcripts.fa")
file_in2 = open("AthalianaslutteandluiN30merged.gtf")
file_out = open("updated.gtf", 'w')

sites = []
result = {}

for line in file_in:
    line = line.strip()
    if line.startswith(">"):
        line = line[1:]
        gene = str.split(line, ".")
        gene = gene[0]
        sites.append(gene)


for line2 in file_in2:
    line2 = line2.strip().split()
    line3 = str.split(line2[11], ";")
    line3 = line3[0]
    line3 = line3[1:-1]
    result[line3] = line2


for id in sites:
    id2 = str(id)
    if id2 in result.keys():
        result[id][1] = "lincRNA"

for val in result.values():
    file_out.write("\t".join(val))
    file_out.write("\n")

Unless an acronym is widely used (i.e. it should be at least listed on the Wikipedia disambiguation page en.wikipedia.org/wiki/DF), then it will be essentially meaningless to those who may be able to answer your question. — timbo
– timbo, Commented Jan 7, 2016 at 23:39
Sorry it is a dataframe or a text file with multiple columns — upendra
– upendra, Commented Jan 7, 2016 at 23:39
Why would you re-invent the wheel? There are perfectly sound libraries for data/dataframe manipulation in Python, like pandas. — Nelewout
– Nelewout, Commented Jan 7, 2016 at 23:53

Community · Accepted Answer · 2017-05-23 12:16:01Z

2

I'll try to give a walkthrough of how you would do this in pandas. Pandas is a python library for handling dataframes and learning it makes it easy to do dataframe manipulations.

Install pandas
```
sudo pip install pandas
```

Load your data into a pandas dataframe object. It seems gtf is a tab delimited file, so pass \t as the separator. If there is no header line pass None, if the first line is a header then pass 0 instead. For more information on the parameters, see here.

import pandas
df = pd.read_csv('AthalianaslutteandluiN30merged.gtf', sep = '\t', header = None, engine = 'python')

    0      1             2       3       4     5 6 7            8  
0   Chr1    Cufflinks   exon 28354206 28354551 . . .    gene_id "XLOC_008369"   transcript_id "TCONS_00014347"  exon_number "1" oId "CUFF.2405.1"   class_code "u"  tss_id "TSS10073"
1   Chr1    Cufflinks   exon 28785549 28786194 . . .    gene_id "XLOC_008370"   transcript_id "TCONS_00014348"  exon_number "1" oId "CUFF.2441.1"   class_code "u"  tss_id "TSS10074"

Check if the strings in column 8 contain a substring that is also contained in your sites list. We will use this idea.
```
sites = ["XLOC_008369", "XLOC_008369"]
pattern = '|'.join(sites)
mask = df[8].str.contains(pattern)
```
Use boolean indexing to change Cufflinks to lincRNA if column 8 contains a substring that matches with an element in sites list. See here, for more on pandas indexing.
```
df.loc[mask,1] = 'lincRNA'
```

EDIT: Use str.contains to check if a pandas column contains an element in the list.

edited May 23, 2017 at 12:16

CommunityBot

11 silver badge

answered Jan 8, 2016 at 2:51

ilyas patanam

5,3722 gold badges31 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

upendra Over a year ago

This is awesome. When i read the gtf file, i'm getting this error "ValueError: Expected 18 fields in line 48, saw 19". What do you think i'm doing wrong?

ilyas patanam Over a year ago

@upendra Pandas expects every row to have the same number columns in this case it expects 18. However, it appears that in line 48, there are 19 columns. It may be best to open your file and see if there is an extra tab or semicolon.

upendra Over a year ago

you are right, there are several rows that have extra columns. Is there a way to deal with this extra columns?

ilyas patanam Over a year ago

@upendra I have edited my response using another function to do the matches. You no longer have to use semicolons as a separator which will prevent additional columns from appearing.

upendra Over a year ago

it finally worked. Thanks a lot for all the help. Much appreciated

Collectives™ on Stack Overflow

How can I replace a column given a specific condition in python?

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related