Loop pages and save contents in Excel file from website in Python

Question

I'm trying to loop pages from this link and extract the interesting part.

Please see the contents in the red circle in the image below.

Here's what I've tried:

url = 'http://so.eastmoney.com/Ann/s?keyword=购买物业&pageindex={}'
for page in range(10):
    r = requests.get(url.format(page))
    soup = BeautifulSoup(r.content, "html.parser")
    print(soup)

xpath for each element (might be helpful for those that don't read Chinese):

/html/body/div[3]/div/div[2]/div[2]/div[3]/h3/span  --> 【润华物业】
/html/body/div[3]/div/div[2]/div[2]/div[3]/h3/a --> 润华物业:关于公司购买理财产品的公告
/html/body/div[3]/div/div[2]/div[2]/div[3]/p/label --> 2017-04-24
/html/body/div[3]/div/div[2]/div[2]/div[3]/p/span --> 公告编号：2017-019 证券代码：836007 证券简称：润华物业  主办券商：国联证券
/html/body/div[3]/div/div[2]/div[2]/div[3]/a --> http://data.eastmoney.com/notices/detail/836007/AN201704250530124271,JWU2JWI2JWE2JWU1JThkJThlJWU3JTg5JWE5JWU0JWI4JTlh.html

I need to save the output to an Excel file. How could I do that in Python? Many thanks.

marc_s · Accepted Answer · 2020-12-26 20:44:58Z

1

BeautifulSoup won't see this stuff, as it's rendered dynamically by JS, but there's an API endpoint you can query to get what you're after.

Here's how:

import requests
import pandas as pd


def clean_up(text: str) -> str:
    return text.replace('</em>', '').replace(':<em>', '').replace('<em>', '')


def get_data(page_number: int) -> dict:
    url = f"http://searchapi.eastmoney.com/business/Web/GetSearchList?type=401&pageindex={page_number}&pagesize=10&keyword=购买物业&name=normal"
    headers = {
        "Referer": f"http://so.eastmoney.com/Ann/s?keyword=%E8%B4%AD%E4%B9%B0%E7%89%A9%E4%B8%9A&pageindex={page_number}",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
    }
    return requests.get(url, headers=headers).json()


def parse_response(response: dict) -> list:
    for item in response["Data"]:
        title = clean_up(item['NoticeTitle'])
        date = item['NoticeDate']
        url = item['Url']
        notice_content = clean_up(" ".join(item['NoticeContent'].split()))
        company_name = item['SecurityFullName']
        print(f"{company_name} - {title} - {date}")
        yield [title, url, date, company_name, notice_content]


def save_results(parsed_response: list):
    df = pd.DataFrame(
        parsed_response,
        columns=['title', 'url', 'date', 'company_name', 'content'],
    )
    df.to_excel("test_output.xlsx", index=False)


if __name__ == "__main__":
    output = []
    for page in range(1, 11):
        for parsed_row in parse_response(get_data(page)):
            output.append(parsed_row)

    save_results(output)

This outputs:

栖霞物业购买资产的公告 - 2019-09-03 16:00:00 - 871792
索克物业购买资产的公告 - 2020-08-17 00:00:00 - 832816
中都物业购买股权的公告 - 2019-12-09 16:00:00 - 872955
开元物业:开元物业购买银行理财产品的公告 - 2015-05-21 16:00:00 - 831971
开元物业:开元物业购买银行理财产品的公告 - 2015-04-12 16:00:00 - 831971
盛全物业:拟购买房产的公告 - 2017-10-30 16:00:00 - 834070
润华物业购买资产暨关联交易公告 - 2016-08-23 16:00:00 - 836007
润华物业购买资产暨关联交易公告 - 2017-08-14 16:00:00 - 836007
萃华珠宝:关于拟购买物业并签署购买意向协议的公告 - 2017-07-10 16:00:00 - 002731
赛意信息:关于购买办公物业的公告 - 2020-12-02 00:00:00 - 300687

And saves this to a .csv file that can be easily handled by excel.

PS. I don't know Chinese (?) so you'd have to look into the response contents and pick more stuff out.

edited Dec 26, 2020 at 20:44

marc_s

760k186 gold badges1.4k silver badges1.5k bronze badges

answered Dec 2, 2020 at 10:56

baduker

20.2k9 gold badges43 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

chitown88 Over a year ago

I was looking for that but couldn't find it. Where did you find the api url?

baduker Over a year ago

@chitown88 it's in the Developer Tool -> Network -> Other and it's the one with GetSearchList initiated by jquery. If you drop the jquery param from the url, you'll get a pure JSON response.

chitown88 Over a year ago

ah ok. I see. I didn't check under Other

baduker Over a year ago

I've updated the answer. You can set up a for loop to "paginate" through the pages, if you wish.

baduker Over a year ago

@ahbon I've updated the answer. This -> str indicates type of the return value.

|

marc_s · Accepted Answer · 2020-12-26 20:44:51Z

Updated answer based on @baduker's solution, but not working out for loop pages.

import requests
import pandas as pd

for page in range(10):
    url = "http://searchapi.eastmoney.com/business/Web/GetSearchList?type=401&pageindex={}&pagesize=10&keyword=购买物业&name=normal"
    headers = {
        "Referer": "http://so.eastmoney.com/Ann/s?keyword=%E8%B4%AD%E4%B9%B0%E7%89%A9%E4%B8%9A&pageindex={}",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
    }
    
    response = requests.get(url, headers=headers).json()
    
    output_data = []
    for item in response["Data"]:
        # print(item)
        # print('*' * 40)
        title = item['NoticeTitle'].replace('</em>', '').replace(':<em>', '').replace('<em>', '')
        url = item['Url']
        date = item['NoticeDate'].split(' ')[0]
        company_name = item['SecurityFullName']
        content = item['NoticeContent'].replace('</em>', '').replace(':<em>', '').replace('<em>', '')
        # url_code = item['Url'].split('/')[5]
        output_data.append([title, url, date, company_name, content])

names = ['title', 'url', 'date', 'company_name', 'content']
df = pd.DataFrame(output_data, columns = names)
df.to_excel('test.xlsx', index = False)

Collectives™ on Stack Overflow

Loop pages and save contents in Excel file from website in Python

2 Answers 2

9 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related