Dynamically extract text from webpage using Python BeautifulSoup

Question

I'm trying to extract player position from many players' webpages (here's an example for Malcolm Brogdon). I'm able to extract Malcolm Brogdon's position using the following code:

player_id = 'malcolm-brogdon-1'

# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np

url = "https://www.sports-reference.com/cbb/players/{}.html".format(player_id)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")

pos = page_soup.p.find("strong").next_sibling.strip()
pos

However, I want to be able to do this in a more dynamic way (that is, to locate "Position:" and then find what comes after). There are other players for which the webpage is structured slightly differently, and my current code wouldn't return position (i.e. Cat Barber).

I've tried doing something like page_soup.find("strong", text="Position:") but that doesn't seem to work.

Andrej Kesely · Accepted Answer · 2020-08-06 04:38:25Z

1

You can select the element that contains the text "Position:" and then the next text sibling:

import requests
from bs4 import BeautifulSoup


url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

pos = soup.select_one('strong:contains("Position")').find_next_sibling(text=True).strip()
print(pos)

Prints:

Guard

EDIT: Another version:

import requests
from bs4 import BeautifulSoup


url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

pos = (
    soup.find("strong", text=lambda t: "Position" in t)
    .find_next_sibling(text=True)
    .strip()
)
print(pos)

edited Aug 6, 2020 at 4:38

answered Aug 6, 2020 at 4:29

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Christine Over a year ago

When I run this code I get the following error: NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type. Any idea why?

Andrej Kesely Over a year ago

@Christine you are using old version of BeautifulSoup. Update to the latest.

Andrej Kesely Over a year ago

@Christine I put also other version, you can try it (maybe it will work with old version of bs4)

Christine Over a year ago

I updated my version because that's probably just a good idea in general. It works great now! Thanks so much. I'll go ahead and mark this as the answer.

Collectives™ on Stack Overflow

Dynamically extract text from webpage using Python BeautifulSoup

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related