I have a dataframe that contains multiple columns as below...
Chr1 Cufflinks exon 28354206 28354551 . . . gene_id "XLOC_008369"; transcript_id "TCONS_00014347"; exon_number "1"; oId "CUFF.2405.1"; class_code "u"; tss_id "TSS10073";
Chr1 Cufflinks exon 28785549 28786194 . . . gene_id "XLOC_008370"; transcript_id "TCONS_00014348"; exon_number "1"; oId "CUFF.2441.1"; class_code "u"; tss_id "TSS10074";
Chr1 Cufflinks exon 29328712 29329210 . . . gene_id "XLOC_008371"; transcript_id "TCONS_00014349"; exon_number "1"; oId "CUFF.2495.1"; class_code "u"; tss_id "TSS10075";
Chr1 Cufflinks exon 29427951 29428406 . . . gene_id "XLOC_008372"; transcript_id "TCONS_00014350"; exon_number "1"; oId "CUFF.2506.1"; class_code "u"; tss_id "TSS10076";
Chr1 Cufflinks exon 29460116 29460585 . . . gene_id "XLOC_008373"; transcript_id "TCONS_00014351"; exon_number "1"; oId "CUFF.2509.1"; class_code "u"; tss_id "TSS10077";
What i am trying to do is, if any of the items in my list is present in one of the column of the dataframe, then i replace the 2nd column from Cufflinks to lincRNA.
One problem is the column that i am using for making the key in the dictionary has multiple rows in the dataframe and because of that i am getting only unique key and so the total number of rows that are outputted are not the same as the input.
Here is my code so far...
#!/usr/bin/env python
file_in = open("lincRNA_final_transcripts.fa")
file_in2 = open("AthalianaslutteandluiN30merged.gtf")
file_out = open("updated.gtf", 'w')
sites = []
result = {}
for line in file_in:
line = line.strip()
if line.startswith(">"):
line = line[1:]
gene = str.split(line, ".")
gene = gene[0]
sites.append(gene)
for line2 in file_in2:
line2 = line2.strip().split()
line3 = str.split(line2[11], ";")
line3 = line3[0]
line3 = line3[1:-1]
result[line3] = line2
for id in sites:
id2 = str(id)
if id2 in result.keys():
result[id][1] = "lincRNA"
for val in result.values():
file_out.write("\t".join(val))
file_out.write("\n")
pandas.