0

I have HTML data and I want to get all the text between the

tags and put it into dataframes for further processing.

But I only want the text in the

tags that are between these tags:

            <div class="someclass" itemprop="text">
                    <p>some text</p>
            </div>

Using BeautifulSoup I can get text between all the

tags easy enough. But as I said, I don't want it unless it is between those tags.

2
  • Can you please post a minimal reproducible example? Commented Jan 5, 2019 at 1:40
  • What code with BeautifulSoup have you tried? You only want text from those specific class "someclass"? Commented Jan 5, 2019 at 1:46

3 Answers 3

2

If want text that is in tags that are associated with only a specific class, with BeautifulSoup you can specify those specific classes with the attrs attribute:

html = '''<div class="someclass" itemprop="text">
                    <p>some text</p>
            </div>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tags = soup.find_all('div', attrs={'class': 'someclass'})

for tag in tags:
    print(tag.text.strip())

output:

some text
Sign up to request clarification or add additional context in comments.

Comments

1

In case you need a table-specific solution, I would try something like this (daveedwards answer is better if you're not!):

import lxml
from bs4 import BeautifulSoup

innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(str(innerHTML.encode('utf-8').strip()), 'lxml')

# Identify the table that will contain your <div> tags by its class
table = soup.find('table', attrs={'class':'class_name_of_table_here'})
table_body = table.find('tbody')
divs = table_body.find_all(['div'], attrs={'class':['someclass']})

for div in divs:
    try:
        selected_text = div.text
    except:
        pass

print(selected_text)

Comments

0

if you want to select p with parent div and has class someclass you can

html = '''<div class="someclass" itemprop="text">
            <p>some text</p>
            <span>not this text</span>   
          </div>
          <div class="someclass" itemprop="text">
            <div>not this text</div>   
          </div>
'''

soup = BeautifulSoup(html, 'html.parser')
p = soup.select_one('div.someclass p') # or select()
print(p.text)
# some text

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.