0

I want to parse a large text file which is segmented with a character '//' in a newline. My input file is like this:

ID   HRPA_ECOLI              Reviewed;        130 AA.
AC   P43329; P76861; P76863; P77479;
DE   RecName: Full=ATP-dependent RNA helicase HrpA;
DE            EC=3.6.4.13;
GN   Name=hrpA; OrderedLocusNames=b1413, JW5905;
OS   Escherichia coli (strain K12).
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;
OC   Enterobacteriaceae; Escherichia.
OX   NCBI_TaxID=83333;
DR   RefSeq; NP_415931.4; NC_000913.3.
DR   RefSeq; WP_000139543.1; NZ_LN832404.1.
DR   ProteinModelPortal; P43329; -.
DR   KEGG; ecj:JW5905; -.
DR   KEGG; eco:b1413; -.
DR   PATRIC; 32118112; VBIEscCol129921_1476.
DR   KO; K03578; -.
DR   GO; GO:0005737; C:cytoplasm; IBA:GO_Central.
DR   GO; GO:0005524; F:ATP binding; IEA:UniProtKB-KW.
DR   Gene3D; 3.40.50.300; -; 2.
DR   InterPro; IPR003593; AAA+_ATPase.
DR   InterPro; IPR011545; DEAD/DEAH_box_helicase_dom.
DR   InterPro; IPR011709; DUF1605.
DR   Pfam; PF00270; DEAD; 1.
DR   Pfam; PF11898; DUF3418; 1.
DR   SMART; SM00382; AAA; 1.
DR   SMART; SM00487; DEXDc; 1.
DR   SMART; SM00847; HA2; 1.
DR   SMART; SM00490; HELICc; 1.
DR   SUPFAM; SSF52540; SSF52540; 1.
DR   TIGRFAMs; TIGR01967; DEAH_box_HrpA; 1.
DR   PROSITE; PS51192; HELICASE_ATP_BIND_1; 1.
DR   PROSITE; PS51194; HELICASE_CTER; 1.
PE   3: Inferred from homology;
KW   ATP-binding; Complete proteome; Helicase; Hydrolase;
KW   Nucleotide-binding; Reference proteome.
FT   CHAIN         1   1300       ATP-dependent RNA helicase HrpA.
FT                                /FTId=PRO_0000055178.
FT   DOMAIN       87    250       Helicase ATP-binding.
FT                                {ECO:0000255|PROSITE-ProRule:PRU00541}.
FT   DOMAIN      274    444       Helicase C-terminal.
SQ   SEQUENCE   1300 AA;  149028 MW;  A26601266D771638 CRC64;
     MTEQQKLTFT ALQQRLDSLM LRDRLRFSRR LHGVKKVKNP DAQQAIFQEM AKEIDQAAGK
     VLLREAARPE ITYPDNLPVS QKKQDILEAI RDHQVVIVAG ETGSGKTTQL PKICMELGRG
     IKGLIGHTQP 
//
ID   T1RK_ECOLI              Reviewed;        1170 AA.
AC   P08956; Q2M5W6;
DT   01-NOV-1988, integrated into UniProtKB/Swiss-Prot.
DT   24-NOV-2009, sequence version 3.

I also have a id.txt file where each line has a unique id like:

NP_415931.4
...

I want to match each id with the input file, and if it matches, I want to extract certain information with regular expression (with that particular segment of the input file) and save them in a output csv file. For example, for a matching character "GO:[0-9]", I come up with :

#!/usr/bin/env python
import re
import pdb

def peon(DATA, LIST, OUTPUT, sentinel = '\n//', pattern = re.compile('GO:[0-9]+')):

    data = DATA.read()  
    for item in LIST:
        find_me = item.strip()          
        j = 0
        while True:           
            i = data.find(find_me, j)
            if i < 0: 
                break
            j = data.find(sentinel, i)
            if j < 0: 
                j = len(data)
            result = pattern.findall(data[i:j])
        OUTPUT.write('{}\t{}\n'.format(find_me, ', '.join(result)))

def main(dataname, listname, outputname):
    with open(dataname, 'rt') as DATA:
        with open(listname, 'rt') as LIST:
            with open(outputname, 'wt') as OUTPUT:
                peon(DATA, LIST, OUTPUT)

if __name__ == '__main__':
    main('./input_file.txt', './id.txt', './output.csv')

and it gives me output like:

NP_415931.4 GO:0005737, GO:0005524

Now , the characters which I want to match are (Number<>Header<>Description),

1   RefSeq_ID   As given in id.txt file
2   AA_Length   In the line that starts with "ID" & ends with "AA."
3   Protein_Name    After "RecName: Full="
4   EC_Number   After "EC="
5   Organism    In the line that starts with "OS"
6   NCBI_Taxid_ID   After "NCBI_TaxID="
7   KEGG_ID After "KEGG;"
8   KO_ID   After "KO;"
9   GO_ID   As ''GO:[NUMBER]"
10  InterPro_ID After "InterPro;"
11  InterPro_Description    After InterPro_ID , i.e, after 10
12  Pfam_ID After "Pfam;"
13  Pfam_Description    After Pfam_ID, i.e, after 12
14  PROSITE_ID  After "PROSITE;"
15  PROSITE_Description After PROSITE_ID, i.e, after 14

I have also attached a pic for better clarification:

input file

I want to extract all those characters simultaneously and save them in a output csv file with specific headers. I am extracting , for example, "AA_Length" after changing regex like:

pattern = re.compile('[0-9]+ AA.') 

and it gives:

NP_415931.4 130 AA;

But its not exactly the pattern that I needed. Also, I'm not sure what will be the regex for before-after matching and how to implement them in a single script.

How can I search all those patterns in a single script and save the output (with header) in a csv file?

Thank you

PS: I want the final output csv to be look like:

output file

My excel sheet is here: https://sites.google.com/site/iicbbioinformatics/share

2
  • Is the id from the list in the same position for each data set? Commented Nov 25, 2016 at 10:28
  • no.. ids can be random Commented Nov 25, 2016 at 12:12

1 Answer 1

1

Your data seems not to have a fixed structure.

How about

  • reading the DATA,
  • split them at "\n//",
  • check each data set for ID from list
  • use one RegExp for each of your needed IDs
  • write the IDs in appropriate order to csv file

?

Basically:

import re, csv

with open('./input_file.txt') as dfile:
    DATA = dfile.read()

with open('./id.txt') as lfile:
    IDS = lfile.read().split('\n')

headers = ['RefSeq_ID',
           'AA_Length',
           'Protein_Name',
           'EC_Number',
           'Organism',
           'NCBI_Taxid_ID',
           'KEGG_ID',
           'KO_ID',
           'GO_ID',
           'InterPro_ID',
           'InterPro_Description',
           'Pfam_ID',
           'Pfam_Description',
           'PROSITE_ID',
           'PROSITE_Description'
           ]

ofile = open('./output.csv', 'w')
csvfile = csv.DictWriter(ofile, headers)

csvfile.writeheader()

for DATASET in DATA.split('\n//'):
    found_ids = {'RefSeq_ID': ""}

    for RefSeq_ID in IDS:
        if RefSeq_ID in DATASET:
            found_ids['RefSeq_ID'] = RefSeq_ID
            break
    if not found_ids['RefSeq_ID']:
        continue

    found_ids['AA_Length'] = ", ".join(re.findall('^ID.+\s+(\d+) AA\.$', DATASET, re.MULTILINE))
    found_ids['Protein_Name'] = ", ".join(re.findall('RecName: Full=(.+);', DATASET))
    found_ids['EC_Number'] = ", ".join(re.findall('EC=([\d\.]+);', DATASET))
    found_ids['Organism'] = ", ".join(re.findall('^OS\s+(.*)\.$', DATASET, re.MULTILINE))
    found_ids['NCBI_Taxid_ID'] = ", ".join(re.findall('NCBI_TaxID=(\d+);', DATASET))
    found_ids['KEGG_ID'] = ", ".join(re.findall('KEGG; (\w+:\w+\d+);', DATASET))
    found_ids['KO_ID'] = ", ".join(re.findall('KO; (K\d+);', DATASET))
    found_ids['GO_ID'] = ", ".join(re.findall('GO; (GO:\d+);', DATASET))
    found_ids['InterPro_ID'] = ", ".join(re.findall('InterPro; (IPR\d+);', DATASET))
    found_ids['InterPro_Description'] = ", ".join(re.findall('InterPro;.*?;(.*)\.', DATASET))
    found_ids['Pfam_ID'] = " ".join(re.findall('Pfam; (PF\d+);', DATASET))
    found_ids['Pfam_Description'] = ", ".join(re.findall('Pfam; PF\d+; (.*?);', DATASET))
    found_ids['PROSITE_ID'] = ", ".join(re.findall('PROSITE; (PS\d+);', DATASET))
    found_ids['PROSITE_Description'] = ", ".join(re.findall('PROSITE; PS\d+; (.*?);', DATASET))

    csvfile.writerow(found_ids)

ofile.close()
Sign up to request clarification or add additional context in comments.

9 Comments

could you help with the regular expression please?
@J.Carter: In your Excel screenshot you partly obscure your parsed data, so I have to guess.
I added the rest of them as they seems reasonable to me. I also merged multiple entries with commas and delimited the rows by tabs like you did. I hope you get the idea now.
Traceback (most recent call last): File "script.py", line 45, in <module> found_ids['NCBI_Taxid_ID'] = re.findall('NCBI_TaxID=(d+);', DATASET)[0] IndexError: list index out of range
NCBI_Taxid_ID, KEGG_ID, KO_ID, GO_ID, InterPro_ID InterPro_Description, Pfam_Description, PROSITE_ID and PROSITE_Description are not showing... but thank you very much sir for your precious help
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.