1

I am currently learning data scraping using the BeautifulSoup package. At the moment, I am trying to get a list of the movie franchises from the Box Office Mojo website (https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab).

The main problem is that I can't seem to access or extract the data within the <main> tag. Below is the code I am using.

import requests
from bs4 import BeautifulSoup

listOfFranchiseLink = "https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"

r = requests.get(listOfFranchiseLink)
soup = BeautifulSoup(r.content, 'html.parser')

s0 = soup.find('div', id='a-page')
s1 = s0.find(id='')
s2 = s1.find('div', id='a-section mojo-body aok-relative')

assert s1 is not None
assert s2 is not None

While the script does find something with 's1', it doesn't seem like what I am expecting (which should contain a div with a class "a-section mojo-body aok-relative") at the top. Thus, I am getting None for 's2'.

My question is:

  1. What am I doing wrong? How can I extract data inside the <main> tag?
  2. I have a feeling creating a soup object for each layer is not very efficient. What is the more standard way to extract data buried within layers of different HTML tags?

Edit: Meant to write s0.find('main') instead of s0.find(id=''). But the former returned the same result as the latter, so it didn't really matter.

2
  • What do you want to scrape? the entire table? Commented Jun 22, 2022 at 14:47
  • Yes, at the moment, I am trying to scrape the table that contains the names of the movie franchises on the website. Commented Jun 22, 2022 at 15:09

1 Answer 1

2

It's because s2 is actually None, because s1 returns this:

<script data-a-state='{"key":"a-wlab-states"}' type="a-state">{}</script>

so searching for id='a-section mojo-body aok-relative should yield nothing. Hence the second assert fails.

If you want to scrape the table, you can go with just pandas and requests, like this:

import requests
import pandas as pd

df = (
    pd.read_html(
        requests.get(
            "https://www.boxofficemojo.com/franchise/?ref_=bo_nb_bns_secondarytab"
        ).text,
        flavor="lxml",
    )[0]
)
print(df)

To get this:

                           Franchise  ... Lifetime Gross
0          Marvel Cinematic Universe  ...   $858,373,000
1                          Star Wars  ...   $936,662,225
2    Disney Live Action Reimaginings  ...   $543,638,043
3                         Spider-Man  ...   $804,789,334
4     J.K. Rowling's Wizarding World  ...   $381,011,219
..                               ...  ...            ...
287                 Ip Man Franchise  ...     $2,679,437
288                   Chal Mera Putt  ...       $644,000
289                           Shiloh  ...     $1,007,822
290                       Evangelion  ...       $174,945
291                            V/H/S  ...       $100,345

[292 rows x 5 columns]
Sign up to request clarification or add additional context in comments.

3 Comments

you beat me while I was installing pandas. :)
@MendelG, I keep a separate venv just for SO questions with pandas always there. :)
Thank you! This is a much more suitable extraction for my use.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.