Python Requests-html not return the page content

Question

I'm new to Python and would like your advice for the issue I've encountered recently. I'm doing a small project where I tried to scrape a comic website to download a chapter (pictures). However, when printing out the page content for testing (because i tried to use Beautifulsoup.select() and got no result), it only showed a line of html:

'document.cookie="VinaHost-Shield=a7a00919549a80aa44d5e1df8a26ae20"+"; path=/";window.location.reload(true);'

Any help would be really appreciated.

from requests_html import HTMLSession
session = HTMLSession()

res = session.get("https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html")
res.html.render()
print(res.content)

I also tried this but the resutl was the same.

import requests, bs4

url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get(url, headers={"User-Agent": "Requests"})
res.raise_for_status()
# soup = bs4.BeautifulSoup(res.text, "html.parser")
# onePiece = soup.select(".page-chapter")
print(res.content)

update: I installed docker and splash (on Windows 11) and it worked. I included the update code. Thanks Franz and others for yor help.

import os
import requests, bs4
os.makedirs("OnePiece", exist_ok=True)
url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get("http://localhost:8050/render.html", params={"url": url, "wait": 5})
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
onePiece = soup.find_all("img", class_="lazy")
for element in onePiece:
    imageLink = "https:" + element["data-cdn"]
    res = requests.get(imageLink)
    imageFile = open(os.path.join("OnePiece", os.path.basename(imageLink)), "wb")
    for chunk in res.iter_content(100000):
        imageFile.write(chunk)
    imageFile.close()

Your problem is not in parser but at downloading page time. Read my answer for details. Good luck — Franz Gastring
– Franz Gastring, Commented Sep 21, 2022 at 20:30

Abhi747 · Accepted Answer · 2022-09-21 12:05:48Z

1

import urllib.request
request_url = urllib.request.urlopen('https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html')
print(request_url.read())

it will return html code of the page. by the way in that html it is loading several images. you need to use regx to trakdown those img urls and download them.

answered Sep 21, 2022 at 12:05

Abhi747

3033 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jim Over a year ago

thanks for your response. I tried it but it did not work for me. I used Franz's guide and it worked.

Franz Gastring · Accepted Answer · 2022-09-21 20:28:41Z

0

This response means that we need a javascript render that reload the page using this cookie. for you get the content some workaround must be added.

I commonly use splash scrapinhub render engine and putting a sleep in the page just renders ok all the content. Some tools that render in same way are selenium for python or pupitter in JS.

Link for Splash and Pupeteer

answered Sep 21, 2022 at 20:28

Franz Gastring

1,1222 gold badges15 silver badges14 bronze badges

1 Comment

Jim Over a year ago

Thanks a lot Franz, it worked for me. I installed docker and splash, and the rest is history.

Collectives™ on Stack Overflow

Python Requests-html not return the page content

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related