scraping AJAX content on webpage with requests python

Question

I'm trying to scraping an AJAX loaded part on a webpage without executing the javascript. By using Chrome dev tool, I found that the AJAX container is pulling the content from a URL through a POST request, so I want to duplicate the request with python requests package. But strangely, by using the Headers information given from Chrome, I always get 400 error, and the same happens with the curl command copied from Chrome. So I'm wondering whether someone could kindly share some insights.

The website I'm interested in is here. Using Chrome: ctrl-shift-I, network, XHR, and the part I want is 'content'. The script I'm using is:

headers = {"authority": "cafe.bithumb.com",
    "path": "/boards/43/contents",
    "method": "POST",
    "origin":"https://cafe.bithumb.com",
    "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36",
    "accept-encoding":"gzip, deflate, br",
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
    "accept":"application/json, text/javascript, */*; q=0.01",
    "referer":"https://cafe.bithumb.com/view/boards/43",
    "x-requested-with":"XMLHttpRequest",
    "scheme": "https",
    "content-length":"1107"}
s=requests.Session()
s.headers.update(headers)
r = s.post('https://cafe.bithumb.com/boards/43/contents')

KC. · Accepted Answer · 2018-11-24 11:42:35Z

1

You just need to compare two post data, then you will find they have almost same except the a few parameter(draw=page...start=xx). That means you can scrape Ajax data by modifying draw and start.

Edit: Data was transformed to dictionary so we do not need urlencode, also we don't need cookie(i tested).

import requests
import json

headers = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Origin": "https://cafe.bithumb.com",
        "X-Requested-With": "XMLHttpRequest",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
        "DNT": "1",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Referer": "https://cafe.bithumb.com/view/boards/43",
        "Accept-Encoding": "gzip, deflate, br"
    }

string = """columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=2&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=false&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=3&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=false&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=4&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=false&columns[4][search][value]=&columns[4][search][regex]=false&start=30&length=30&search[value]=&search[regex]=false"""


article_root = "https://cafe.bithumb.com/view/board-contents/{}"

for page in range(1,4):
    with requests.Session() as s:
        s.headers.update(headers)

        data = {"draw":page}
        data.update( { ele[:ele.find("=")]:ele[ele.find("=")+1:] for ele in string.split("&") } )
        data["start"] = 30 * (page - 1)

        r = s.post('https://cafe.bithumb.com/boards/43/contents', data = data, verify = False) # set verify = False while you are using fiddler

        json_data = json.loads(r.text).get("data") # transform string to dict then we can extract data easier
        for each in json_data:
            url = article_root.format(each[0])
            print(url)

edited Nov 24, 2018 at 11:42

answered Nov 24, 2018 at 6:21

KC.

3,1072 gold badges14 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Lampard Over a year ago

thanks @kcorlidy, your codes works like a charm. I'm new to this, could you please elaborate about: 1.how to compare the two post data? The codes in my question was trying to replicate what I got from chrome dev tool, but I was not able to intercept the original request. 2. where did you get data please? 3. your codes fetches the content of the 2nd page, I've tried page =0 but it does not work, how should I change the code? 4. It seems to take 3-4 sec to get the response compared to ~0.5 within chrome, is it normal? Sorry for the barrage of questions and thanks very much!

KC. Over a year ago

@Lampard 1 and 2.use fiddler 3.there is no page zero, page > 0. 4. It is normal, why you take 3-4 sec because you are setting up an new connection. And i will edit my answer to show you what to do

KC. Over a year ago

@Lampard i'm sorry because i haven't access page 2 at first. I found the response is dependent on draw and start. i have edited my answer and compared with browser's response(same)

Collectives™ on Stack Overflow

scraping AJAX content on webpage with requests python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related