I am currently learning data scraping using the BeautifulSoup package. At the moment, I am trying to get a list of the movie franchises from the Box Office Mojo website (https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab).
The main problem is that I can't seem to access or extract the data within the <main> tag. Below is the code I am using.
import requests
from bs4 import BeautifulSoup
listOfFranchiseLink = "https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"
r = requests.get(listOfFranchiseLink)
soup = BeautifulSoup(r.content, 'html.parser')
s0 = soup.find('div', id='a-page')
s1 = s0.find(id='')
s2 = s1.find('div', id='a-section mojo-body aok-relative')
assert s1 is not None
assert s2 is not None
While the script does find something with 's1', it doesn't seem like what I am expecting (which should contain a div with a class "a-section mojo-body aok-relative") at the top. Thus, I am getting None for 's2'.
My question is:
- What am I doing wrong? How can I extract data inside the <main> tag?
- I have a feeling creating a soup object for each layer is not very efficient. What is the more standard way to extract data buried within layers of different HTML tags?
Edit: Meant to write s0.find('main') instead of s0.find(id=''). But the former returned the same result as the latter, so it didn't really matter.