I'm a relatively newb in python. I need some advice for a bioinformatics project. It's about converting certain enzyme IDs to others.
What I already did and what works, is fetch the html code for a list of IDs from the Rhea database:
53 url2 = "http://www.rhea-db.org/reaction?id=16952"
54 f_xml2 = open("xml_tempfile2.txt", "w")
55
56 fetch2 = pycurl.Curl()
57 fetch2.setopt(fetch2.URL, url2)
58 fetch2.setopt(fetch.WRITEDATA, f_xml2)
59 fetch2.perform()
60 fetch2.close
So the HTML code is saved to a temporary txt file (I know, possibly not the most elegant way to do stuff, but it works for me ;).
Now what I am interested in is the following part from the HTML:
<p>
<h3>Same participants, different directions</h3>
<div>
<a href="./reaction?id=16949"><span>RHEA:16949</span></a>
<span class="icon-question">myo-inositol + NAD(+) <?> scyllo-inosose + H(+) + NADH</span>
</div><div>
<a href="./reaction?id=16950"><span>RHEA:16950</span></a>
<span class="icon-arrow-right">myo-inositol + NAD(+) => scyllo-inosose + H(+) + NADH</span>
</div><div>
<a href="./reaction?id=16951"><span>RHEA:16951</span></a>
<span class="icon-arrow-left-1">scyllo-inosose + H(+) + NADH => myo-inositol + NAD(+)</span>
</div>
</p>
I want to go through the code until the class "icon-arrow-right" is reached (this expression is unique in the HTML). Then I want to extract the information of "RHEA:XXXXXX" from the line above. So in this example, I want to end up with 16950.
Is there a simple way to do this? I've already experimented with HTMLparser but couldn't get it to work in a way that it looks for a certain class and then gives me the ID from the line above.
Thank you very much in advance!