Extracting specific information from fetched HTML code using python

Question

I'm a relatively newb in python. I need some advice for a bioinformatics project. It's about converting certain enzyme IDs to others.

What I already did and what works, is fetch the html code for a list of IDs from the Rhea database:

 53 url2 = "http://www.rhea-db.org/reaction?id=16952"
 54 f_xml2 = open("xml_tempfile2.txt", "w")
 55
 56 fetch2 = pycurl.Curl()
 57 fetch2.setopt(fetch2.URL, url2)
 58 fetch2.setopt(fetch.WRITEDATA, f_xml2)
 59 fetch2.perform()
 60 fetch2.close

So the HTML code is saved to a temporary txt file (I know, possibly not the most elegant way to do stuff, but it works for me ;).

Now what I am interested in is the following part from the HTML:

        <p>
            <h3>Same participants, different directions</h3>
            <div>
                <a href="./reaction?id=16949"><span>RHEA:16949</span></a>
                <span class="icon-question">myo-inositol + NAD(+) &lt;?&gt; scyllo-inosose + H(+) + NADH</span>
            </div><div>
                <a href="./reaction?id=16950"><span>RHEA:16950</span></a>
                <span class="icon-arrow-right">myo-inositol + NAD(+) =&gt; scyllo-inosose + H(+) + NADH</span>
            </div><div>
                <a href="./reaction?id=16951"><span>RHEA:16951</span></a>
                <span class="icon-arrow-left-1">scyllo-inosose + H(+) + NADH =&gt; myo-inositol + NAD(+)</span>
            </div>
        </p>

I want to go through the code until the class "icon-arrow-right" is reached (this expression is unique in the HTML). Then I want to extract the information of "RHEA:XXXXXX" from the line above. So in this example, I want to end up with 16950.

Is there a simple way to do this? I've already experimented with HTMLparser but couldn't get it to work in a way that it looks for a certain class and then gives me the ID from the line above.

Thank you very much in advance!

Sede · Accepted Answer · 2016-04-16 07:29:50Z

1

You can use an HTML parser like BeautifulSoup to do this:

>>> from bs4 import BeautifulSoup
>>> html = """ <p>
...             <h3>Same participants, different directions</h3>
...             <div>
...                 <a href="./reaction?id=16949"><span>RHEA:16949</span></a>
...                 <span class="icon-question">myo-inositol + NAD(+) &lt;?&gt; scyllo-inosose + H(+) + NADH</span>
...             </div><div>
...                 <a href="./reaction?id=16950"><span>RHEA:16950</span></a>
...                 <span class="icon-arrow-right">myo-inositol + NAD(+) =&gt; scyllo-inosose + H(+) + NADH</span>
...             </div><div>
...                 <a href="./reaction?id=16951"><span>RHEA:16951</span></a>
...                 <span class="icon-arrow-left-1">scyllo-inosose + H(+) + NADH =&gt; myo-inositol + NAD(+)</span>
...             </div>
...         </p>"""
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find('span', class_='icon-arrow-right').find_previous_sibling().get_text()
'RHEA:16950'

answered Apr 16, 2016 at 7:29

Sede

61.5k20 gold badges158 silver badges162 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

fumarat Over a year ago

Thank you very much! Works perfectly!

fumarat Over a year ago

@user310015 So I guess, find_previous_sinbling() in this context works, because the RHEA:ID is also in <span>?

Sede Over a year ago

@fumarat not at all. As you can see "RHEA:ID" is in <a> find_previous_sibling() here returns an element that precedes the element with class "icon-arrow-right"

Collectives™ on Stack Overflow

Extracting specific information from fetched HTML code using python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related