Scrape content of element with data- attribute - Python BeautifulSoup

Question

I want to extract the text content which sits behind an a-tag element. The code looks like this:

<a data-autid="article-url" href="linkToTheWebsite">HERE STANDS THE TEXT I WANT TO EXTRACT</a>

In the past these a-tag elements didn't have an "data-" attribute, but a normal "id" attribute, which was super simple to extract. Now I have no idea how this should work. I tried this but it doesn't appear to do the job:

self.article_title = item.select_one('a', data_autid='article-url').text.strip()

Any idea what I could do?

MendelG · Accepted Answer · 2021-05-03 17:01:05Z

3

You can use an [attr=value] CSS Selector:

Represents elements with an attribute name of attr whose value is exactly value.

To use a CSS Selector, use the .select_one() method instead of find().

In your example:

from bs4 import BeautifulSoup

html = """<a data-autid="article-url" href="linkToTheWebsite">HERE STANDS THE TEXT I WANT TO EXTRACT</a>"""
soup = BeautifulSoup(html, "html.parser")

>>> print(soup.select_one('a[data-autid="article-url"]').text)
HERE STANDS THE TEXT I WANT TO EXTRACT

Or: If you want to use find():

print(soup.find("a", attrs={"data-autid": "article-url"}).text)

edited May 3, 2021 at 17:01

answered May 3, 2021 at 16:54

MendelG

20.6k5 gold badges38 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Niklas Klotz Over a year ago

Sadly none of the both options works. Don't know why. Doesn't give an error or anything, just no content for the variable comes in

MendelG Over a year ago

@NiklasKlotz The page is probably loaded dynamically. You should use a module called selenium instead.

Matteo Bianchi · Accepted Answer · 2021-05-03 17:22:42Z

0

You can try this:

from lxml import html
import requests

html = requests.get('yoururl')
tree = html.fromstring(html.content)
yourtext = tree.xpath('//a[@data-autid="article-url"]/text()')

answered May 3, 2021 at 17:22

Matteo Bianchi

4421 gold badge6 silver badges20 bronze badges

1 Comment

MendelG Over a year ago

Why use lxml? the OP has tagged BeautifulSoup

Collectives™ on Stack Overflow

Scrape content of element with data- attribute - Python BeautifulSoup

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related