0

Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here.

Now I would like to go further and click the url link:

enter image description here

For each url, I will need to open and save pdf format files:

enter image description here

How could I do that in Python? Any help would be greatly appreciated.

Code for references:

import shutil
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'xxx'
for page in range(6):
    r = requests.get(url.format(page))
    soup = BeautifulSoup(r.content, "html.parser")
    for link in soup.select("h3[class='sv-card-title']>a"):
        r = requests.get(link.get("href"), stream=True)
        r.raw.decode_content = True
        with open('./files/' + link.text + '.pdf', 'wb') as f:
            shutil.copyfileobj(r.raw, f)
2
  • The link.text probably is "查看PDF原文".Do you really want to rename the file as it?Otherwise the pdf file you have downloaded would be covered. Commented Dec 7, 2020 at 7:07
  • It would be ideal if named by title of each transaction, or the last part of url. Commented Dec 7, 2020 at 7:33

2 Answers 2

1

An example of download a pdf file in your uploaded excel file.

from bs4 import BeautifulSoup
import requests

# Let's assume there is only one page.If you need to download many files, save them in a list.

url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

link = soup.select_one(".lookmore")
title = soup.select_one(".newsContent").select_one("h1").text

print(title.strip() + '.pdf')
data = requests.get(link.get("href")).content
with open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # file name shouldn't contain ':', so I replace it to "-"
    f.write(data)

And download successfully:

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1

Here's bit different approach. You don't have to open those urls from the excel file as you can build the .pdf file source urls yourself.

For example:

import requests

urls = [
    "http://data.eastmoney.com/notices/detail/871792/AN201909041348533085,JWU2JWEwJTk2JWU5JTljJTllJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/872955/AN201912101371726768,JWU0JWI4JWFkJWU5JTgzJWJkJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/832816/AN202008171399155565,JWU3JWI0JWEyJWU1JTg1JThiJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/831971/AN201505220009713696,JWU1JWJjJTgwJWU1JTg1JTgzJWU3JTg5JWE5JWU0JWI4JTlh.html",
]

for url in urls:
    file_id, _ = url.split('/')[-1].split(',')
    pdf_file_url = f"http://pdf.dfcfw.com/pdf/H2_{file_id}_1.pdf"
    print(f"Fetching {pdf_file_url}...")
    with open(f"{file_id}.pdf", "wb") as f:
        f.write(requests.get(pdf_file_url).content)

1 Comment

Thanks, would you mind to add your code to your answer in last question. It could be very helpful.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.