3

I want to extract the text content which sits behind an a-tag element. The code looks like this:

<a data-autid="article-url" href="linkToTheWebsite">HERE STANDS THE TEXT I WANT TO EXTRACT</a>

In the past these a-tag elements didn't have an "data-" attribute, but a normal "id" attribute, which was super simple to extract. Now I have no idea how this should work. I tried this but it doesn't appear to do the job:

self.article_title = item.select_one('a', data_autid='article-url').text.strip()

Any idea what I could do?

0

2 Answers 2

3

You can use an [attr=value] CSS Selector:

Represents elements with an attribute name of attr whose value is exactly value.


To use a CSS Selector, use the .select_one() method instead of find().

In your example:

from bs4 import BeautifulSoup

html = """<a data-autid="article-url" href="linkToTheWebsite">HERE STANDS THE TEXT I WANT TO EXTRACT</a>"""
soup = BeautifulSoup(html, "html.parser")

>>> print(soup.select_one('a[data-autid="article-url"]').text)
HERE STANDS THE TEXT I WANT TO EXTRACT

Or: If you want to use find():

print(soup.find("a", attrs={"data-autid": "article-url"}).text)
Sign up to request clarification or add additional context in comments.

2 Comments

Sadly none of the both options works. Don't know why. Doesn't give an error or anything, just no content for the variable comes in
@NiklasKlotz The page is probably loaded dynamically. You should use a module called selenium instead.
0

You can try this:

from lxml import html
import requests

html = requests.get('yoururl')
tree = html.fromstring(html.content)
yourtext = tree.xpath('//a[@data-autid="article-url"]/text()')

1 Comment

Why use lxml? the OP has tagged BeautifulSoup

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.