Webscraping from React web application after componentDidMount

Question

I'm currently using Python3 as a way to learn webscraping and ran into a curious issue. For context, I'm trying to scrape some data off of https://www.cnn.com/ and retrieve various news headlines. I'm using the requests and BeautifulSoup libraries.

I wasn't getting anything substantive in my responses. Upon sending a simpler request:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.cnn.com/')
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)

I'm met with a bunch of what looks like CSS and some js. Only at the bottom do I see a div, which I assume React is rendering to. The problem with this is that I can't get data in this way. What I think is happening is that CNN fills in this data using some kind of useEffect or componentDidMount, meaning that it won't initially appear in the initial DOM. This of course isn't anything of concern to a human user but causes some problems here.

What can I do to circumvent this issue?

Bertrand Martel · Accepted Answer · 2020-06-13 02:41:10Z

On Chrome developer console, if you check the network tab, you would see a bunch of requests ending with /zone-manager.izl :

The content is JSON with an html field which contains some html content (including the healines we are looking for)

Content is organized in 4 zones with 2 format of url. Here is a sample code to get all of these :

import requests

pageType1 = "_intl-homepage-zone-injection/index.html:intl_homepage-injection-zone"
pageType2 = "index.html:intl_homepage1-zone"

for i in range(1,5):
    r = requests.get(f"https://edition.cnn.com/data/ocs/section/{pageType1}-{i}/views/zones/common/zone-manager.izl")
    print(r.json()["html"])
    r = requests.get(f"https://edition.cnn.com/data/ocs/section/{pageType2}-{i}/views/zones/common/zone-manager.izl")
    print(r.json()["html"])

It seems the URL which gives the headline is :

https://edition.cnn.com/data/ocs/section/index.html:intl_homepage1-zone-1/views/zones/common/zone-manager.izl

Then, you can start using beautifulsoup or any html parser to extract your data.

For instance to get h2 and h3 tags (aka headlines) :

import requests
from bs4 import BeautifulSoup

r = requests.get("https://edition.cnn.com/data/ocs/section/index.html:intl_homepage1-zone-1/views/zones/common/zone-manager.izl")
soup = BeautifulSoup(r.text, 'html.parser')

print(soup.find("h2").text)
print([t.text for t in soup.findAll("h3")])

Output :

Military leaders take a stand as Trump stays silent
['The US military -- which Trump often uses to bolster himself as a commander in chief -- is moving on from the President on racial inequality', 'Derek Chauvin eligible for $1M pension', 'Live Protests continue to grow across the US', "analysis Floyd protests have a plot twist I didn't see coming", "Fox News anchor calls out Trump for saying he's done more for African Americans than any president", 'What if the next Donald Trump is, well, Donald Trump?', "Cuomo: Proof of systemic racism is in Trump's Cabinet", "Videos raise question about in-custody death deemed an 'accident' by officials", 'Woman caught on video harassing Asian American exercising in park', 'The Tyrion Lannister lookalike dreaming of Bollywood stardom', 'New book about Melania Trump says she renegotiated her prenuptial agreement', 'Young Americans are having less sex', "Kareem Abdul-Jabbar's son arrested for allegedly stabbing neighbor", 'Outrage over single mother who died after waiting days for bus home during lockdown', 'Face masks are best way to reduce coronavirus transmission, study finds', 'Stunning images show how virus is overrunning hospitals', 'Achaeologist jailed for faking finds', 'Poland invaded Czech Republic last month, says it was just a misunderstanding']

Very cool. I didn't think of viewing network requests in order to find the relevant data. I'll definitely keep this in mind for future webscraping activities :)

Collectives™ on Stack Overflow

Webscraping from React web application after componentDidMount

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related