1

I'm new to Python and would like your advice for the issue I've encountered recently. I'm doing a small project where I tried to scrape a comic website to download a chapter (pictures). However, when printing out the page content for testing (because i tried to use Beautifulsoup.select() and got no result), it only showed a line of html:

'document.cookie="VinaHost-Shield=a7a00919549a80aa44d5e1df8a26ae20"+"; path=/";window.location.reload(true);'

Any help would be really appreciated.

from requests_html import HTMLSession
session = HTMLSession()

res = session.get("https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html")
res.html.render()
print(res.content)

I also tried this but the resutl was the same.

import requests, bs4

url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get(url, headers={"User-Agent": "Requests"})
res.raise_for_status()
# soup = bs4.BeautifulSoup(res.text, "html.parser")
# onePiece = soup.select(".page-chapter")
print(res.content)

update: I installed docker and splash (on Windows 11) and it worked. I included the update code. Thanks Franz and others for yor help.

import os
import requests, bs4
os.makedirs("OnePiece", exist_ok=True)
url = "https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html"
res = requests.get("http://localhost:8050/render.html", params={"url": url, "wait": 5})
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
onePiece = soup.find_all("img", class_="lazy")
for element in onePiece:
    imageLink = "https:" + element["data-cdn"]
    res = requests.get(imageLink)
    imageFile = open(os.path.join("OnePiece", os.path.basename(imageLink)), "wb")
    for chunk in res.iter_content(100000):
        imageFile.write(chunk)
    imageFile.close()
1
  • Your problem is not in parser but at downloading page time. Read my answer for details. Good luck Commented Sep 21, 2022 at 20:30

2 Answers 2

1
import urllib.request
request_url = urllib.request.urlopen('https://truyenqqpro.com/truyen-tranh/dao-hai-tac-128-chap-1060.html')
print(request_url.read())

it will return html code of the page. by the way in that html it is loading several images. you need to use regx to trakdown those img urls and download them.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks for your response. I tried it but it did not work for me. I used Franz's guide and it worked.
0

This response means that we need a javascript render that reload the page using this cookie. for you get the content some workaround must be added.

This

I commonly use splash scrapinhub render engine and putting a sleep in the page just renders ok all the content. Some tools that render in same way are selenium for python or pupitter in JS.

Link for Splash and Pupeteer

enter image description here

1 Comment

Thanks a lot Franz, it worked for me. I installed docker and splash, and the rest is history.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.