Loop url from dataframe and download pdf files in Python

Question

Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here.

Now I would like to go further and click the url link:

For each url, I will need to open and save pdf format files:

How could I do that in Python? Any help would be greatly appreciated.

Code for references:

import shutil
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'xxx'
for page in range(6):
    r = requests.get(url.format(page))
    soup = BeautifulSoup(r.content, "html.parser")
    for link in soup.select("h3[class='sv-card-title']>a"):
        r = requests.get(link.get("href"), stream=True)
        r.raw.decode_content = True
        with open('./files/' + link.text + '.pdf', 'wb') as f:
            shutil.copyfileobj(r.raw, f)

The link.text probably is "查看PDF原文".Do you really want to rename the file as it?Otherwise the pdf file you have downloaded would be covered. — jizhihaoSAMA
– jizhihaoSAMA, Commented Dec 7, 2020 at 7:07
It would be ideal if named by title of each transaction, or the last part of url. — ah bon
– ah bon, Commented Dec 7, 2020 at 7:33

jizhihaoSAMA · Accepted Answer · 2020-12-07 07:39:36Z

1

An example of download a pdf file in your uploaded excel file.

from bs4 import BeautifulSoup
import requests

# Let's assume there is only one page.If you need to download many files, save them in a list.

url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

link = soup.select_one(".lookmore")
title = soup.select_one(".newsContent").select_one("h1").text

print(title.strip() + '.pdf')
data = requests.get(link.get("href")).content
with open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # file name shouldn't contain ':', so I replace it to "-"
    f.write(data)

And download successfully:

answered Dec 7, 2020 at 7:39

jizhihaoSAMA

12.7k9 gold badges32 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

baduker · Accepted Answer · 2020-12-07 07:57:45Z

1

Here's bit different approach. You don't have to open those urls from the excel file as you can build the .pdf file source urls yourself.

For example:

import requests

urls = [
    "http://data.eastmoney.com/notices/detail/871792/AN201909041348533085,JWU2JWEwJTk2JWU5JTljJTllJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/872955/AN201912101371726768,JWU0JWI4JWFkJWU5JTgzJWJkJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/832816/AN202008171399155565,JWU3JWI0JWEyJWU1JTg1JThiJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/831971/AN201505220009713696,JWU1JWJjJTgwJWU1JTg1JTgzJWU3JTg5JWE5JWU0JWI4JTlh.html",
]

for url in urls:
    file_id, _ = url.split('/')[-1].split(',')
    pdf_file_url = f"http://pdf.dfcfw.com/pdf/H2_{file_id}_1.pdf"
    print(f"Fetching {pdf_file_url}...")
    with open(f"{file_id}.pdf", "wb") as f:
        f.write(requests.get(pdf_file_url).content)

answered Dec 7, 2020 at 7:57

baduker

20.2k9 gold badges43 silver badges63 bronze badges

1 Comment

ah bon Over a year ago

Thanks, would you mind to add your code to your answer in last question. It could be very helpful.

Collectives™ on Stack Overflow

Loop url from dataframe and download pdf files in Python

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related