0

I have a dataframe that contains multiple columns as below...

Chr1    Cufflinks   exon    28354206    28354551    .   .   .   gene_id "XLOC_008369"; transcript_id "TCONS_00014347"; exon_number "1"; oId "CUFF.2405.1"; class_code "u"; tss_id "TSS10073";
Chr1    Cufflinks   exon    28785549    28786194    .   .   .   gene_id "XLOC_008370"; transcript_id "TCONS_00014348"; exon_number "1"; oId "CUFF.2441.1"; class_code "u"; tss_id "TSS10074";
Chr1    Cufflinks   exon    29328712    29329210    .   .   .   gene_id "XLOC_008371"; transcript_id "TCONS_00014349"; exon_number "1"; oId "CUFF.2495.1"; class_code "u"; tss_id "TSS10075";
Chr1    Cufflinks   exon    29427951    29428406    .   .   .   gene_id "XLOC_008372"; transcript_id "TCONS_00014350"; exon_number "1"; oId "CUFF.2506.1"; class_code "u"; tss_id "TSS10076";
Chr1    Cufflinks   exon    29460116    29460585    .   .   .   gene_id "XLOC_008373"; transcript_id "TCONS_00014351"; exon_number "1"; oId "CUFF.2509.1"; class_code "u"; tss_id "TSS10077";

What i am trying to do is, if any of the items in my list is present in one of the column of the dataframe, then i replace the 2nd column from Cufflinks to lincRNA.

One problem is the column that i am using for making the key in the dictionary has multiple rows in the dataframe and because of that i am getting only unique key and so the total number of rows that are outputted are not the same as the input.

Here is my code so far...

#!/usr/bin/env python

file_in = open("lincRNA_final_transcripts.fa")
file_in2 = open("AthalianaslutteandluiN30merged.gtf")
file_out = open("updated.gtf", 'w')

sites = []
result = {}

for line in file_in:
    line = line.strip()
    if line.startswith(">"):
        line = line[1:]
        gene = str.split(line, ".")
        gene = gene[0]
        sites.append(gene)


for line2 in file_in2:
    line2 = line2.strip().split()
    line3 = str.split(line2[11], ";")
    line3 = line3[0]
    line3 = line3[1:-1]
    result[line3] = line2


for id in sites:
    id2 = str(id)
    if id2 in result.keys():
        result[id][1] = "lincRNA"

for val in result.values():
    file_out.write("\t".join(val))
    file_out.write("\n")
5
  • 1
    Could you explain what is a df? Commented Jan 7, 2016 at 23:34
  • Unless an acronym is widely used (i.e. it should be at least listed on the Wikipedia disambiguation page en.wikipedia.org/wiki/DF), then it will be essentially meaningless to those who may be able to answer your question. Commented Jan 7, 2016 at 23:39
  • Sorry it is a dataframe or a text file with multiple columns Commented Jan 7, 2016 at 23:39
  • I have now edited my question Commented Jan 7, 2016 at 23:41
  • 2
    Why would you re-invent the wheel? There are perfectly sound libraries for data/dataframe manipulation in Python, like pandas. Commented Jan 7, 2016 at 23:53

1 Answer 1

2

I'll try to give a walkthrough of how you would do this in pandas. Pandas is a python library for handling dataframes and learning it makes it easy to do dataframe manipulations.

  1. Install pandas

    sudo pip install pandas
    
  2. Load your data into a pandas dataframe object. It seems gtf is a tab delimited file, so pass \t as the separator. If there is no header line pass None, if the first line is a header then pass 0 instead. For more information on the parameters, see here.

    import pandas
    df = pd.read_csv('AthalianaslutteandluiN30merged.gtf', sep = '\t', header = None, engine = 'python')
    
        0      1             2       3       4     5 6 7            8  
    0   Chr1    Cufflinks   exon 28354206 28354551 . . .    gene_id "XLOC_008369"   transcript_id "TCONS_00014347"  exon_number "1" oId "CUFF.2405.1"   class_code "u"  tss_id "TSS10073"
    1   Chr1    Cufflinks   exon 28785549 28786194 . . .    gene_id "XLOC_008370"   transcript_id "TCONS_00014348"  exon_number "1" oId "CUFF.2441.1"   class_code "u"  tss_id "TSS10074"
    
  3. Check if the strings in column 8 contain a substring that is also contained in your sites list. We will use this idea.

    sites = ["XLOC_008369", "XLOC_008369"]
    pattern = '|'.join(sites)
    mask = df[8].str.contains(pattern)
    
  4. Use boolean indexing to change Cufflinks to lincRNA if column 8 contains a substring that matches with an element in sites list. See here, for more on pandas indexing.

    df.loc[mask,1] = 'lincRNA'
    

EDIT: Use str.contains to check if a pandas column contains an element in the list.

Sign up to request clarification or add additional context in comments.

5 Comments

This is awesome. When i read the gtf file, i'm getting this error "ValueError: Expected 18 fields in line 48, saw 19". What do you think i'm doing wrong?
@upendra Pandas expects every row to have the same number columns in this case it expects 18. However, it appears that in line 48, there are 19 columns. It may be best to open your file and see if there is an extra tab or semicolon.
you are right, there are several rows that have extra columns. Is there a way to deal with this extra columns?
@upendra I have edited my response using another function to do the matches. You no longer have to use semicolons as a separator which will prevent additional columns from appearing.
it finally worked. Thanks a lot for all the help. Much appreciated

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.