0
<div id="descriptionmodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">Description</h3>
    </div>
    <div id="issue-description" class="mod-content">
        <p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<ul class="alternate" type="square">
    <li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul>

I want only the Q's . I tried this

doc=lh.fromstring(resp.read())  
for id in doc.cssselect('div.mod-content' ):
    print id.text_content()

This gives me the q's but it also gives me other details on the page with class mod-content. How do i specifically get only the q's.

I am using lxml.

<div id="peoplemodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">People</h3>
    </div>
    <div class="mod-content">
        <ul class="item-details" id="peopledetails">
            <li class="people-details">
                                <dl>
                    <dt>Assignee:</dt>
                    <dd id="Assign-Val">
                                <a class="user-hover" rel="605794069" id="issue_summary_assignee_605794069" href="--------------"> AAAAAAAAAAAAA a>
                    </dd>
                </dl>
                                                <dl>
                    <dt>Reporter:</dt>
                    <dd id="Report-Val">
                                <a class="user-hover" rel="700843051" id="issue_summary_reporter_700843051" href="-------------------------">BBBBBBBBBBBBBB</a>
                    </dd>
                </dl>
                                <dl><dt>&nbsp;</dt><dd>&nbsp;</dd></dl>
                                <dl>
                    <dt title="Multiple Assignees">Multiple Assignees:</dt>
                    <dd id="customfield_10020-val">    <div class="shorten" id="customfield_10020-field">
                                    <span class="tinylink">        <a class="user-hover" rel="604810609" id="multiuser_cf_604810609" href------------------">FFFFFFFFFFFFFF</a></span>,                                                 <span class="tinylink">        <a class="user-hover" rel="600548483" id="multiuser_cf_600548483" href="------------------------------------">EEEEEEEEEEEEEEEEE</a></span>                        </div>
</dd>
                </dl>
                            </li>
        </ul>
                        <div id="watchers-val">
                                                <a href="----------------------------------------" id="watching-toggle" rel="858270" title="Start watching this story"><span class="icon icon-watch-off"></span><span class="action-text">Watch</span></a>


                            (<span id="watcher-data">1</span>)
                    </div>
            </div>
</div>
3
  • What "other details"? There is only q's in the snippet you shared. And, your answer very much depends on the source for the particular website. Commented Dec 13, 2011 at 7:39
  • I forgot to mention , This snippet is a small part of the webpage, And mod-content class is used elsewhere too ,hence while printting,it prints the other values too. Commented Dec 13, 2011 at 8:14
  • As I said, it depends on the website and the content you are interested. You need to provide sufficient specificity for the content. For example, if this is the only div that you want, you can select by its id since it is supposed to be unique. Commented Dec 13, 2011 at 8:37

1 Answer 1

1

First off: if you are parsing HTML there is a high chance humans will have messed up with it and it won't validate correctly. For example this is the case for the example you posted (there are a couple of </div> missing...). Consider passing to beautifulsoup instead, which is specifically designed to accommodate for these kind of errors.

That said, if your question is just about how to extract the "textual part of the HTML", or in other words how to convert HTML → plain text [as opposed to "extracting only the text contained in specific HTML containers], this is a minimal working example:

from lxml import etree

content = '''<div id="descriptionmodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">Description</h3>
    </div>
    <div id="issue-description" class="mod-content">
        <p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<ul class="alternate" type="square">
    <li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul></div></div>'''

tree = etree.fromstring(content)

for bit in tree.xpath('//text()'):
    if bit.strip():  # you can insert any kind of test here
        print bit

It outputs:

Description
qqqqqqqqqqqqq,

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

HTH!

Sign up to request clarification or add additional context in comments.

2 Comments

Hi Mac ,Thnx for your answer, I edited my question, In that scenario , xpath identifier text can be modified further, right? to meet the necessary conitions, I need the text again from it. It gives error , Is it because of the structure of the page?
@VinodK - Can you clarify a bit your question? If you are trying to match only certain tags of your document you could use something like print tree.find(".//h3").text [this - in the example provided in my answer - would return "Description"]... but as Avaris pointed out in the comments, it's up to you to identify what is the unique characteristic of the document leaf you want to extract...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.