1

I have the following html:

<h2>blah</h2>
html content to extract 
(here can come tags, nested structures too, but no top-level h2)
<h2>other blah</h2>

Can I extract the content without using string.split("<h2>") in python?
(Say, with BeautifulSoup or with some other library?)

3 Answers 3

1

Here are some test code using HTQL from http://htql.net:

sample="""<h2>blah</h2>
        html content to extract 
        <div>test</div>
        <h2>other blah<h2>
    """

import htql
htql.query(sample, "<h2 sep excl>2")
# [('\n        html content to extract \n        <div>test</div>\n        ',)]

htql.query(sample, "<h2 sep> {a=<h2>:tx; b=<h2 sep excl>2 | a='blah'} ")
# [('blah', '\n        html content to extract \n        <div>test</div>\n        ')]
Sign up to request clarification or add additional context in comments.

Comments

1

With BeautifulSoup, use the .next_siblings iterable to get to text following a tag:

>>> from bs4 import BeautifulSoup, NavigableString
>>> from itertools import takewhile
>>> sample = '<h2>blah</h2>\nhtml content to extract\n<h2>other blah<h2>'
>>> soup = BeautifulSoup(sample)
>>> print ''.join(takewhile(lambda e: isinstance(e, NavigableString), soup.h2.next_siblings))

html content to extract

This finds all text elements following the soup.h2 element and joins them into one string.

2 Comments

+1, ok, so basically I do a next_sibling iteration until I reach the next h2?
Exactly; the takewhile() call here loops over siblings as long as they are instances of NavigableString, and the next sibling that is not a string is the H2 tag.
0

Let me share a bit more robust solution:

def get_chunk_after_tag(tag):
    """ tag is a tag element in a bs4 soup.
    """
    result = ''
    for elem in tag.next_siblings:
        if isinstance(elem, bs4.Tag) and elem.name == tag.name:
            break
        result += str(elem)
    return result

For extracting text from <hX> to <hX>. It is easily modified to extract text from a tag to another.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.