1

I'm trying to scraping an AJAX loaded part on a webpage without executing the javascript. By using Chrome dev tool, I found that the AJAX container is pulling the content from a URL through a POST request, so I want to duplicate the request with python requests package. But strangely, by using the Headers information given from Chrome, I always get 400 error, and the same happens with the curl command copied from Chrome. So I'm wondering whether someone could kindly share some insights.

The website I'm interested in is here. Using Chrome: ctrl-shift-I, network, XHR, and the part I want is 'content'. The script I'm using is:

headers = {"authority": "cafe.bithumb.com",
    "path": "/boards/43/contents",
    "method": "POST",
    "origin":"https://cafe.bithumb.com",
    "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36",
    "accept-encoding":"gzip, deflate, br",
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
    "accept":"application/json, text/javascript, */*; q=0.01",
    "referer":"https://cafe.bithumb.com/view/boards/43",
    "x-requested-with":"XMLHttpRequest",
    "scheme": "https",
    "content-length":"1107"}
s=requests.Session()
s.headers.update(headers)
r = s.post('https://cafe.bithumb.com/boards/43/contents')

1 Answer 1

1

You just need to compare two post data, then you will find they have almost same except the a few parameter(draw=page...start=xx). That means you can scrape Ajax data by modifying draw and start.

Edit: Data was transformed to dictionary so we do not need urlencode, also we don't need cookie(i tested).

import requests
import json

headers = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Origin": "https://cafe.bithumb.com",
        "X-Requested-With": "XMLHttpRequest",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
        "DNT": "1",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Referer": "https://cafe.bithumb.com/view/boards/43",
        "Accept-Encoding": "gzip, deflate, br"
    }

string = """columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=2&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=false&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=3&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=false&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=4&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=false&columns[4][search][value]=&columns[4][search][regex]=false&start=30&length=30&search[value]=&search[regex]=false"""


article_root = "https://cafe.bithumb.com/view/board-contents/{}"

for page in range(1,4):
    with requests.Session() as s:
        s.headers.update(headers)

        data = {"draw":page}
        data.update( { ele[:ele.find("=")]:ele[ele.find("=")+1:] for ele in string.split("&") } )
        data["start"] = 30 * (page - 1)

        r = s.post('https://cafe.bithumb.com/boards/43/contents', data = data, verify = False) # set verify = False while you are using fiddler

        json_data = json.loads(r.text).get("data") # transform string to dict then we can extract data easier
        for each in json_data:
            url = article_root.format(each[0])
            print(url)
Sign up to request clarification or add additional context in comments.

3 Comments

thanks @kcorlidy, your codes works like a charm. I'm new to this, could you please elaborate about: 1.how to compare the two post data? The codes in my question was trying to replicate what I got from chrome dev tool, but I was not able to intercept the original request. 2. where did you get data please? 3. your codes fetches the content of the 2nd page, I've tried page =0 but it does not work, how should I change the code? 4. It seems to take 3-4 sec to get the response compared to ~0.5 within chrome, is it normal? Sorry for the barrage of questions and thanks very much!
@Lampard 1 and 2.use fiddler 3.there is no page zero, page > 0. 4. It is normal, why you take 3-4 sec because you are setting up an new connection. And i will edit my answer to show you what to do
@Lampard i'm sorry because i haven't access page 2 at first. I found the response is dependent on draw and start. i have edited my answer and compared with browser's response(same)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.