2

So i have function that is called when i click a button , it goes as below

var min_news_id = "68feb985-1d08-4f5d-8855-cb35ae6c3e93-1";
function loadMoreNews(){
  $("#load-more-btn").hide();
  $("#load-more-gif").show();
  $.post("/en/ajax/more_news",{'category':'','news_offset':min_news_id},function(data){
      data = JSON.parse(data);
      min_news_id = data.min_news_id||min_news_id;
      $(".card-stack").append(data.html);
  })
  .fail(function(){alert("Error : unable to load more news");})
  .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
}
jQuery.scrollDepth();

Now i don't have much experience with javascript , but i assume its returning some json data from some sort of api at "en/ajax/more_news" .

Is there i way could directly call this api and get the json data from my python script. If Yes,how?

If not how do i scrape the content that is being generated?

8
  • Use urllib2 to retrieve the data from the API, and json.loads to parse the JSON into a Python dictionary. Commented Jul 9, 2016 at 7:10
  • @Barmar What exactly do i need to send , are you suggesting something like this ? r = requests.post('http://inshorts.com/en/ajax/more_news', json={'category':'','news_offset':min_news_id}) Commented Jul 9, 2016 at 7:16
  • 1
    Yeah, that's pretty much it. Then use json.loads(r) to parse the JSON response, and r['html'] will contain the HTML from the response. Commented Jul 9, 2016 at 7:17
  • @Barmar I tried but it just redirected me to the home page . import json import requests min_news_id="68feb985-1d08-4f5d-8855-cb35ae6c3e93-1" r = requests.post('http://inshorts.com/en/ajax/more_news', json={'category':'','news_offset':min_news_id}) print (r.url) Commented Jul 9, 2016 at 7:32
  • Actually, it should probably be data=, not json=. Commented Jul 9, 2016 at 7:36

2 Answers 2

1

You need to post the news id that you see inside the script to https://www.inshorts.com/en/ajax/more_news, this is an example using requests:

from bs4 import BeautifulSoup
import requests
import re

# pattern to extract min_news_id
patt = re.compile('var min_news_id\s+=\s+"(.*?)"')

with requests.Session() as s:
    soup = BeautifulSoup(s.get("https://www.inshorts.com/en/read").content)
    new_id_scr = soup.find("script", text=re.compile("var\s+min_news_id"))
    print(new_id_scr.text)
    news_id = patt.search(new_id_scr.text).group()
    js = s.post("https://www.inshorts.com/en/ajax/more_news", data={"news_offset":news_id})
    print(js.json())

js gives you all the html, you just have to access the js["html"].

Sign up to request clarification or add additional context in comments.

5 Comments

It is giving empty result O/P: {'html': '\n\n'}
You need to change the code like thi news_id = news_id.split('"') js = s.post("https://www.inshorts.com/en/ajax/more_news", data={"news_offset":news_id[1]}) in your code news_id is showing var min_news_id = "vxy8k83f-1" so I just extract news id value from it. Now it is working properly
@SalmanMohammad, just use patt.search(new_id_scr.text).group(1)
Yes patt.search(new_id_scr.text).group(1) works. it gives plain news id like vxy8k83f-1.
How long it will get the data from load more result? like will it get the result from only one page that come after cliking on load more button or it will iterativley get result from load more option.
0

Here is the script that will automatically loop through all the pages in inshort.com

from bs4 import BeautifulSoup
from newspaper import Article
import requests
import sys
import re
import json

patt = re.compile('var min_news_id\s+=\s+"(.*?)"')
i = 0
while(1):
    with requests.Session() as s:
        if(i==0):soup = BeautifulSoup(s.get("https://www.inshorts.com/en/read").content,"lxml")
  new_id_scr = soup.find("script", text=re.compile("var\s+min_news_id"))
   news_id = patt.search(new_id_scr.text).group(1)

    js = s.post("https://www.inshorts.com/en/ajax/more_news", data={"news_offset":news_id})
    jsn = json.dumps(js.json())
    jsonToPython = json.loads(jsn)
    news_id = jsonToPython["min_news_id"]
    data = jsonToPython["html"]
    i += 1
    soup = BeautifulSoup(data, "lxml")
    for tag in soup.find_all("div", {"class":"news-card"}):
        main_text = tag.find("div", {"itemprop":"articleBody"})
        summ_text = main_text.text
        summ_text = summ_text.replace("\n", " ")
        result = tag.find("a", {"class":"source"})
        art_url = result.get('href') 
        if 'www.youtube.com' in art_url:
            print("Nothing")
        else:
            art_url = art_url[:-1]
            #print("Hello", art_url)
            article = Article(art_url)
            article.download()
            if article.is_downloaded:
                article.parse()
                article_text = article.text
                article_text = article_text.replace("\n", " ")

                print(article_text+"\n")
                print(summ_text+"\n")        

It gives both the summary from inshort.com and complete news from respective news channel.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.