Python regular expression implementation in this script

Question

I want to parse a large text file which is segmented with a character '//' in a newline. My input file is like this:

ID   HRPA_ECOLI              Reviewed;        130 AA.
AC   P43329; P76861; P76863; P77479;
DE   RecName: Full=ATP-dependent RNA helicase HrpA;
DE            EC=3.6.4.13;
GN   Name=hrpA; OrderedLocusNames=b1413, JW5905;
OS   Escherichia coli (strain K12).
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;
OC   Enterobacteriaceae; Escherichia.
OX   NCBI_TaxID=83333;
DR   RefSeq; NP_415931.4; NC_000913.3.
DR   RefSeq; WP_000139543.1; NZ_LN832404.1.
DR   ProteinModelPortal; P43329; -.
DR   KEGG; ecj:JW5905; -.
DR   KEGG; eco:b1413; -.
DR   PATRIC; 32118112; VBIEscCol129921_1476.
DR   KO; K03578; -.
DR   GO; GO:0005737; C:cytoplasm; IBA:GO_Central.
DR   GO; GO:0005524; F:ATP binding; IEA:UniProtKB-KW.
DR   Gene3D; 3.40.50.300; -; 2.
DR   InterPro; IPR003593; AAA+_ATPase.
DR   InterPro; IPR011545; DEAD/DEAH_box_helicase_dom.
DR   InterPro; IPR011709; DUF1605.
DR   Pfam; PF00270; DEAD; 1.
DR   Pfam; PF11898; DUF3418; 1.
DR   SMART; SM00382; AAA; 1.
DR   SMART; SM00487; DEXDc; 1.
DR   SMART; SM00847; HA2; 1.
DR   SMART; SM00490; HELICc; 1.
DR   SUPFAM; SSF52540; SSF52540; 1.
DR   TIGRFAMs; TIGR01967; DEAH_box_HrpA; 1.
DR   PROSITE; PS51192; HELICASE_ATP_BIND_1; 1.
DR   PROSITE; PS51194; HELICASE_CTER; 1.
PE   3: Inferred from homology;
KW   ATP-binding; Complete proteome; Helicase; Hydrolase;
KW   Nucleotide-binding; Reference proteome.
FT   CHAIN         1   1300       ATP-dependent RNA helicase HrpA.
FT                                /FTId=PRO_0000055178.
FT   DOMAIN       87    250       Helicase ATP-binding.
FT                                {ECO:0000255|PROSITE-ProRule:PRU00541}.
FT   DOMAIN      274    444       Helicase C-terminal.
SQ   SEQUENCE   1300 AA;  149028 MW;  A26601266D771638 CRC64;
     MTEQQKLTFT ALQQRLDSLM LRDRLRFSRR LHGVKKVKNP DAQQAIFQEM AKEIDQAAGK
     VLLREAARPE ITYPDNLPVS QKKQDILEAI RDHQVVIVAG ETGSGKTTQL PKICMELGRG
     IKGLIGHTQP 
//
ID   T1RK_ECOLI              Reviewed;        1170 AA.
AC   P08956; Q2M5W6;
DT   01-NOV-1988, integrated into UniProtKB/Swiss-Prot.
DT   24-NOV-2009, sequence version 3.

I also have a id.txt file where each line has a unique id like:

NP_415931.4
...

I want to match each id with the input file, and if it matches, I want to extract certain information with regular expression (with that particular segment of the input file) and save them in a output csv file. For example, for a matching character "GO:[0-9]", I come up with :

#!/usr/bin/env python
import re
import pdb

def peon(DATA, LIST, OUTPUT, sentinel = '\n//', pattern = re.compile('GO:[0-9]+')):

    data = DATA.read()  
    for item in LIST:
        find_me = item.strip()          
        j = 0
        while True:           
            i = data.find(find_me, j)
            if i < 0: 
                break
            j = data.find(sentinel, i)
            if j < 0: 
                j = len(data)
            result = pattern.findall(data[i:j])
        OUTPUT.write('{}\t{}\n'.format(find_me, ', '.join(result)))

def main(dataname, listname, outputname):
    with open(dataname, 'rt') as DATA:
        with open(listname, 'rt') as LIST:
            with open(outputname, 'wt') as OUTPUT:
                peon(DATA, LIST, OUTPUT)

if __name__ == '__main__':
    main('./input_file.txt', './id.txt', './output.csv')

and it gives me output like:

NP_415931.4 GO:0005737, GO:0005524

Now , the characters which I want to match are (Number<>Header<>Description),

1   RefSeq_ID   As given in id.txt file
2   AA_Length   In the line that starts with "ID" & ends with "AA."
3   Protein_Name    After "RecName: Full="
4   EC_Number   After "EC="
5   Organism    In the line that starts with "OS"
6   NCBI_Taxid_ID   After "NCBI_TaxID="
7   KEGG_ID After "KEGG;"
8   KO_ID   After "KO;"
9   GO_ID   As ''GO:[NUMBER]"
10  InterPro_ID After "InterPro;"
11  InterPro_Description    After InterPro_ID , i.e, after 10
12  Pfam_ID After "Pfam;"
13  Pfam_Description    After Pfam_ID, i.e, after 12
14  PROSITE_ID  After "PROSITE;"
15  PROSITE_Description After PROSITE_ID, i.e, after 14

I have also attached a pic for better clarification:

I want to extract all those characters simultaneously and save them in a output csv file with specific headers. I am extracting , for example, "AA_Length" after changing regex like:

pattern = re.compile('[0-9]+ AA.')

and it gives:

NP_415931.4 130 AA;

But its not exactly the pattern that I needed. Also, I'm not sure what will be the regex for before-after matching and how to implement them in a single script.

How can I search all those patterns in a single script and save the output (with header) in a csv file?

Thank you

PS: I want the final output csv to be look like:

My excel sheet is here: https://sites.google.com/site/iicbbioinformatics/share

Is the id from the list in the same position for each data set? — Robin Koch
– Robin Koch, Commented Nov 25, 2016 at 10:28

Robin Koch · Accepted Answer · 2016-11-25 15:20:27Z

1

Your data seems not to have a fixed structure.

How about

reading the DATA,
split them at "\n//",
check each data set for ID from list
use one RegExp for each of your needed IDs
write the IDs in appropriate order to csv file

?

Basically:

import re, csv

with open('./input_file.txt') as dfile:
    DATA = dfile.read()

with open('./id.txt') as lfile:
    IDS = lfile.read().split('\n')

headers = ['RefSeq_ID',
           'AA_Length',
           'Protein_Name',
           'EC_Number',
           'Organism',
           'NCBI_Taxid_ID',
           'KEGG_ID',
           'KO_ID',
           'GO_ID',
           'InterPro_ID',
           'InterPro_Description',
           'Pfam_ID',
           'Pfam_Description',
           'PROSITE_ID',
           'PROSITE_Description'
           ]

ofile = open('./output.csv', 'w')
csvfile = csv.DictWriter(ofile, headers)

csvfile.writeheader()

for DATASET in DATA.split('\n//'):
    found_ids = {'RefSeq_ID': ""}

    for RefSeq_ID in IDS:
        if RefSeq_ID in DATASET:
            found_ids['RefSeq_ID'] = RefSeq_ID
            break
    if not found_ids['RefSeq_ID']:
        continue

    found_ids['AA_Length'] = ", ".join(re.findall('^ID.+\s+(\d+) AA\.$', DATASET, re.MULTILINE))
    found_ids['Protein_Name'] = ", ".join(re.findall('RecName: Full=(.+);', DATASET))
    found_ids['EC_Number'] = ", ".join(re.findall('EC=([\d\.]+);', DATASET))
    found_ids['Organism'] = ", ".join(re.findall('^OS\s+(.*)\.$', DATASET, re.MULTILINE))
    found_ids['NCBI_Taxid_ID'] = ", ".join(re.findall('NCBI_TaxID=(\d+);', DATASET))
    found_ids['KEGG_ID'] = ", ".join(re.findall('KEGG; (\w+:\w+\d+);', DATASET))
    found_ids['KO_ID'] = ", ".join(re.findall('KO; (K\d+);', DATASET))
    found_ids['GO_ID'] = ", ".join(re.findall('GO; (GO:\d+);', DATASET))
    found_ids['InterPro_ID'] = ", ".join(re.findall('InterPro; (IPR\d+);', DATASET))
    found_ids['InterPro_Description'] = ", ".join(re.findall('InterPro;.*?;(.*)\.', DATASET))
    found_ids['Pfam_ID'] = " ".join(re.findall('Pfam; (PF\d+);', DATASET))
    found_ids['Pfam_Description'] = ", ".join(re.findall('Pfam; PF\d+; (.*?);', DATASET))
    found_ids['PROSITE_ID'] = ", ".join(re.findall('PROSITE; (PS\d+);', DATASET))
    found_ids['PROSITE_Description'] = ", ".join(re.findall('PROSITE; PS\d+; (.*?);', DATASET))

    csvfile.writerow(found_ids)

ofile.close()

edited Nov 25, 2016 at 15:20

answered Nov 25, 2016 at 10:34

Robin Koch

7263 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

J.Carter Over a year ago

could you help with the regular expression please?

Robin Koch Over a year ago

@J.Carter: In your Excel screenshot you partly obscure your parsed data, so I have to guess.

Robin Koch Over a year ago

I added the rest of them as they seems reasonable to me. I also merged multiple entries with commas and delimited the rows by tabs like you did. I hope you get the idea now.

J.Carter Over a year ago

Traceback (most recent call last): File "script.py", line 45, in <module> found_ids['NCBI_Taxid_ID'] = re.findall('NCBI_TaxID=(d+);', DATASET)[0] IndexError: list index out of range

J.Carter Over a year ago

NCBI_Taxid_ID, KEGG_ID, KO_ID, GO_ID, InterPro_ID InterPro_Description, Pfam_Description, PROSITE_ID and PROSITE_Description are not showing... but thank you very much sir for your precious help

|

Collectives™ on Stack Overflow

Python regular expression implementation in this script

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related