1

I'm currently using Python3 as a way to learn webscraping and ran into a curious issue. For context, I'm trying to scrape some data off of https://www.cnn.com/ and retrieve various news headlines. I'm using the requests and BeautifulSoup libraries.

I wasn't getting anything substantive in my responses. Upon sending a simpler request:

import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.cnn.com/')
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)

I'm met with a bunch of what looks like CSS and some js. Only at the bottom do I see a div, which I assume React is rendering to. The problem with this is that I can't get data in this way. What I think is happening is that CNN fills in this data using some kind of useEffect or componentDidMount, meaning that it won't initially appear in the initial DOM. This of course isn't anything of concern to a human user but causes some problems here.

What can I do to circumvent this issue?

1 Answer 1

1

On Chrome developer console, if you check the network tab, you would see a bunch of requests ending with /zone-manager.izl :

Headline requests

The content is JSON with an html field which contains some html content (including the healines we are looking for)

Content is organized in 4 zones with 2 format of url. Here is a sample code to get all of these :

import requests

pageType1 = "_intl-homepage-zone-injection/index.html:intl_homepage-injection-zone"
pageType2 = "index.html:intl_homepage1-zone"

for i in range(1,5):
    r = requests.get(f"https://edition.cnn.com/data/ocs/section/{pageType1}-{i}/views/zones/common/zone-manager.izl")
    print(r.json()["html"])
    r = requests.get(f"https://edition.cnn.com/data/ocs/section/{pageType2}-{i}/views/zones/common/zone-manager.izl")
    print(r.json()["html"])

It seems the URL which gives the headline is :

https://edition.cnn.com/data/ocs/section/index.html:intl_homepage1-zone-1/views/zones/common/zone-manager.izl

Then, you can start using or any html parser to extract your data.

For instance to get h2 and h3 tags (aka headlines) :

import requests
from bs4 import BeautifulSoup

r = requests.get("https://edition.cnn.com/data/ocs/section/index.html:intl_homepage1-zone-1/views/zones/common/zone-manager.izl")
soup = BeautifulSoup(r.text, 'html.parser')

print(soup.find("h2").text)
print([t.text for t in soup.findAll("h3")])

Output :

Military leaders take a stand as Trump stays silent
['The US military -- which Trump often uses to bolster himself as a commander in chief -- is moving on from the President on racial inequality', 'Derek Chauvin eligible for $1M pension', 'Live Protests continue to grow across the US', "analysis Floyd protests have a plot twist I didn't see coming", "Fox News anchor calls out Trump for saying he's done more for African Americans than any president", 'What if the next Donald Trump is, well, Donald Trump?', "Cuomo: Proof of systemic racism is in Trump's Cabinet", "Videos raise question about in-custody death deemed an 'accident' by officials", 'Woman caught on video harassing Asian American exercising in park', 'The Tyrion Lannister lookalike dreaming of Bollywood stardom', 'New book about Melania Trump says she renegotiated her prenuptial agreement', 'Young Americans are having less sex', "Kareem Abdul-Jabbar's son arrested for allegedly stabbing neighbor", 'Outrage over single mother who died after waiting days for bus home during lockdown', 'Face masks are best way to reduce coronavirus transmission, study finds', 'Stunning images show how virus is overrunning hospitals', 'Achaeologist jailed for faking finds', 'Poland invaded Czech Republic last month, says it was just a misunderstanding']
Sign up to request clarification or add additional context in comments.

1 Comment

Very cool. I didn't think of viewing network requests in order to find the relevant data. I'll definitely keep this in mind for future webscraping activities :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.