Parsing Html data using LXML

Question

<div id="descriptionmodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">Description</h3>
    </div>
    <div id="issue-description" class="mod-content">
        <p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<ul class="alternate" type="square">
    <li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul>

I want only the Q's . I tried this

doc=lh.fromstring(resp.read())  
for id in doc.cssselect('div.mod-content' ):
    print id.text_content()

This gives me the q's but it also gives me other details on the page with class mod-content. How do i specifically get only the q's.

I am using lxml.

<div id="peoplemodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">People</h3>
    </div>
    <div class="mod-content">
        <ul class="item-details" id="peopledetails">
            <li class="people-details">
                                <dl>
                    <dt>Assignee:</dt>
                    <dd id="Assign-Val">
                                <a class="user-hover" rel="605794069" id="issue_summary_assignee_605794069" href="--------------"> AAAAAAAAAAAAA a>
                    </dd>
                </dl>
                                                <dl>
                    <dt>Reporter:</dt>
                    <dd id="Report-Val">
                                <a class="user-hover" rel="700843051" id="issue_summary_reporter_700843051" href="-------------------------">BBBBBBBBBBBBBB</a>
                    </dd>
                </dl>
                                <dl><dt>&nbsp;</dt><dd>&nbsp;</dd></dl>
                                <dl>
                    <dt title="Multiple Assignees">Multiple Assignees:</dt>
                    <dd id="customfield_10020-val">    <div class="shorten" id="customfield_10020-field">
                                    <span class="tinylink">        <a class="user-hover" rel="604810609" id="multiuser_cf_604810609" href------------------">FFFFFFFFFFFFFF</a></span>,                                                 <span class="tinylink">        <a class="user-hover" rel="600548483" id="multiuser_cf_600548483" href="------------------------------------">EEEEEEEEEEEEEEEEE</a></span>                        </div>
</dd>
                </dl>
                            </li>
        </ul>
                        <div id="watchers-val">
                                                <a href="----------------------------------------" id="watching-toggle" rel="858270" title="Start watching this story"><span class="icon icon-watch-off"></span><span class="action-text">Watch</span></a>


                            (<span id="watcher-data">1</span>)
                    </div>
            </div>
</div>

What "other details"? There is only q's in the snippet you shared. And, your answer very much depends on the source for the particular website. — Avaris
– Avaris, Commented Dec 13, 2011 at 7:39
I forgot to mention , This snippet is a small part of the webpage, And mod-content class is used elsewhere too ,hence while printting,it prints the other values too. — Vinod K
– Vinod K, Commented Dec 13, 2011 at 8:14
As I said, it depends on the website and the content you are interested. You need to provide sufficient specificity for the content. For example, if this is the only div that you want, you can select by its id since it is supposed to be unique. — Avaris
– Avaris, Commented Dec 13, 2011 at 8:37

mac · Accepted Answer · 2011-12-13 08:14:56Z

1

First off: if you are parsing HTML there is a high chance humans will have messed up with it and it won't validate correctly. For example this is the case for the example you posted (there are a couple of </div> missing...). Consider passing to beautifulsoup instead, which is specifically designed to accommodate for these kind of errors.

That said, if your question is just about how to extract the "textual part of the HTML", or in other words how to convert HTML → plain text [as opposed to "extracting only the text contained in specific HTML containers], this is a minimal working example:

from lxml import etree

content = '''<div id="descriptionmodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">Description</h3>
    </div>
    <div id="issue-description" class="mod-content">
        <p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<ul class="alternate" type="square">
    <li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul></div></div>'''

tree = etree.fromstring(content)

for bit in tree.xpath('//text()'):
    if bit.strip():  # you can insert any kind of test here
        print bit

It outputs:

Description
qqqqqqqqqqqqq,

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

HTH!

answered Dec 13, 2011 at 8:14

mac

43.2k27 gold badges126 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vinod K Over a year ago

Hi Mac ,Thnx for your answer, I edited my question, In that scenario , xpath identifier text can be modified further, right? to meet the necessary conitions, I need the text again from it. It gives error , Is it because of the structure of the page?

mac Over a year ago

@VinodK - Can you clarify a bit your question? If you are trying to match only certain tags of your document you could use something like print tree.find(".//h3").text [this - in the example provided in my answer - would return "Description"]... but as Avaris pointed out in the comments, it's up to you to identify what is the unique characteristic of the document leaf you want to extract...

Collectives™ on Stack Overflow

Parsing Html data using LXML

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related