Scraping AJAX loaded content with python?

Question

So i have function that is called when i click a button , it goes as below

var min_news_id = "68feb985-1d08-4f5d-8855-cb35ae6c3e93-1";
function loadMoreNews(){
  $("#load-more-btn").hide();
  $("#load-more-gif").show();
  $.post("/en/ajax/more_news",{'category':'','news_offset':min_news_id},function(data){
      data = JSON.parse(data);
      min_news_id = data.min_news_id||min_news_id;
      $(".card-stack").append(data.html);
  })
  .fail(function(){alert("Error : unable to load more news");})
  .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();

Now i don't have much experience with javascript , but i assume its returning some json data from some sort of api at "en/ajax/more_news" .

Is there i way could directly call this api and get the json data from my python script. If Yes,how?

If not how do i scrape the content that is being generated?

Use urllib2 to retrieve the data from the API, and json.loads to parse the JSON into a Python dictionary. — Barmar
– Barmar, Commented Jul 9, 2016 at 7:10
@Barmar What exactly do i need to send , are you suggesting something like this ? r = requests.post('http://inshorts.com/en/ajax/more_news', json={'category':'','news_offset':min_news_id}) — A. Sam
– A. Sam, Commented Jul 9, 2016 at 7:16
Yeah, that's pretty much it. Then use json.loads(r) to parse the JSON response, and r['html'] will contain the HTML from the response. — Barmar
– Barmar, Commented Jul 9, 2016 at 7:17
@Barmar I tried but it just redirected me to the home page . import json import requests min_news_id="68feb985-1d08-4f5d-8855-cb35ae6c3e93-1" r = requests.post('http://inshorts.com/en/ajax/more_news', json={'category':'','news_offset':min_news_id}) print (r.url) — A. Sam
– A. Sam, Commented Jul 9, 2016 at 7:32

Padraic Cunningham · Accepted Answer · 2016-07-09 10:57:13Z

1

You need to post the news id that you see inside the script to https://www.inshorts.com/en/ajax/more_news, this is an example using requests:

from bs4 import BeautifulSoup
import requests
import re

# pattern to extract min_news_id
patt = re.compile('var min_news_id\s+=\s+"(.*?)"')

with requests.Session() as s:
    soup = BeautifulSoup(s.get("https://www.inshorts.com/en/read").content)
    new_id_scr = soup.find("script", text=re.compile("var\s+min_news_id"))
    print(new_id_scr.text)
    news_id = patt.search(new_id_scr.text).group()
    js = s.post("https://www.inshorts.com/en/ajax/more_news", data={"news_offset":news_id})
    print(js.json())

js gives you all the html, you just have to access the js["html"].

edited Jul 9, 2016 at 10:57

answered Jul 9, 2016 at 10:49

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Salman Mohammad Over a year ago

It is giving empty result O/P: {'html': '\n\n'}

Salman Mohammad Over a year ago

You need to change the code like thi

news_id = news_id.split('"')    js = s.post("https://www.inshorts.com/en/ajax/more_news", data={"news_offset":news_id[1]})

in your code news_id is showing var min_news_id = "vxy8k83f-1" so I just extract news id value from it. Now it is working properly

Padraic Cunningham Over a year ago

@SalmanMohammad, just use patt.search(new_id_scr.text).group(1)

Salman Mohammad Over a year ago

Yes patt.search(new_id_scr.text).group(1) works. it gives plain news id like vxy8k83f-1.

Salman Mohammad Over a year ago

How long it will get the data from load more result? like will it get the result from only one page that come after cliking on load more button or it will iterativley get result from load more option.

Salman Mohammad · Accepted Answer · 2017-06-10 05:22:12Z

Here is the script that will automatically loop through all the pages in inshort.com

from bs4 import BeautifulSoup
from newspaper import Article
import requests
import sys
import re
import json

patt = re.compile('var min_news_id\s+=\s+"(.*?)"')
i = 0
while(1):
    with requests.Session() as s:
        if(i==0):soup = BeautifulSoup(s.get("https://www.inshorts.com/en/read").content,"lxml")
  new_id_scr = soup.find("script", text=re.compile("var\s+min_news_id"))
   news_id = patt.search(new_id_scr.text).group(1)

    js = s.post("https://www.inshorts.com/en/ajax/more_news", data={"news_offset":news_id})
    jsn = json.dumps(js.json())
    jsonToPython = json.loads(jsn)
    news_id = jsonToPython["min_news_id"]
    data = jsonToPython["html"]
    i += 1
    soup = BeautifulSoup(data, "lxml")
    for tag in soup.find_all("div", {"class":"news-card"}):
        main_text = tag.find("div", {"itemprop":"articleBody"})
        summ_text = main_text.text
        summ_text = summ_text.replace("\n", " ")
        result = tag.find("a", {"class":"source"})
        art_url = result.get('href') 
        if 'www.youtube.com' in art_url:
            print("Nothing")
        else:
            art_url = art_url[:-1]
            #print("Hello", art_url)
            article = Article(art_url)
            article.download()
            if article.is_downloaded:
                article.parse()
                article_text = article.text
                article_text = article_text.replace("\n", " ")

                print(article_text+"\n")
                print(summ_text+"\n")

It gives both the summary from inshort.com and complete news from respective news channel.

Collectives™ on Stack Overflow

Scraping AJAX loaded content with python?

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related