3

I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.

I am scraping a news site using python with packages such as Beautiful Soup and etc.

I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.

Here is the part of HTML page which I am scraping:(containing only script part)

<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

  <script type="text/javascript" src="/dist/scripts/index.js"></script>
  <script type="text/javascript" src="/dist/scripts/read.js"></script>
  <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
  <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
      $("#load-more-btn").hide();
      $("#load-more-gif").show();
      $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
          data = JSON.parse(data);
          min_news_id = data.min_news_id||min_news_id; // line 2
          $(".card-stack").append(data.html);
      })
      .fail(function(){alert("Error : unable to load more news");})
      .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
  </script>

From the above part, I want to get the value of min_news_id in python. I should also get the value of same variable if updated from line 2.

Here is how I am doing it:

    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
    page = bs(htmlPage, "html.parser")
    //find all the scripts tag
    scripts = page.find_all("script")
    for script in scripts:
        for line in script:
            scriptString = str(line)
            if "min_news_id" in scriptString:
                scriptString.replace('"', '\\"')
                print(scriptString)
                if(self.pattern.match(str(scriptString))):
                    print("matched")
                    data = self.pattern.match(scriptString)
                    jsVariable = json.loads(data.groups()[0])
                    InShortsScraper.newsOffset = jsVariable
                    print(InShortsScraper.newsOffset)

But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me. Thank You in advance.

8
  • 1
    Some dynamic contents are not rendered when scraping with BeautifulSoup. What you're seeing in browser vs what your scraper is getting is markedly different. (You can export page.content and compare). You'll need a different module like selenium or request-html that can handle dynamic contents. Commented Nov 13, 2018 at 14:58
  • @Idlehands Thank you very much for the information. If you have any example reference please add it. Commented Nov 13, 2018 at 15:00
  • Can you share the URL? Commented Nov 13, 2018 at 15:24
  • inshorts.com/en/read/politics Commented Nov 13, 2018 at 15:26
  • By using requests is the javascript data ALWAYS there? Also, is it the variable, in your above example, d7zlgjdu-1 that you're looking for? Commented Nov 13, 2018 at 15:37

3 Answers 3

2
html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

  <script type="text/javascript" src="/dist/scripts/index.js"></script>
  <script type="text/javascript" src="/dist/scripts/read.js"></script>
  <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
  <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
      $("#load-more-btn").hide();
      $("#load-more-gif").show();
      $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
          data = JSON.parse(data);
          min_news_id = data.min_news_id||min_news_id; // line 2
          $(".card-stack").append(data.html);
      })
      .fail(function(){alert("Error : unable to load more news");})
      .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
  </script>'''

finder = re.findall(r'min_news_id = .*;', html)
print(finder)

Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']

#2 OR YOU CAN USE

print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

Output:
d7zlgjdu-1

#3 OR YOU CAN USE

finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)   

Output:
['d7zlgjdu-1'] 
Sign up to request clarification or add additional context in comments.

4 Comments

Its not handling the value of the variable, once if it is updated
What do you mean handle the value? What are you trying to accomplish?
First I will get the articles from opening the url, and we have load more button on the page, so I want to make call to load more button and get more articles. here is the form-data to the http request : category: politics news_offset: afk0bz0p-1 and the url to make http post request is https://inshorts.com/en/ajax/more_news
So that’s more than your original question implies. What is this variable that you’re scraping doing to submit form data?
1

you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json

from bs4 import BeautifulSoup
import requests, re

page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'

htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...

# get current min_news_id
min_news_id = re.search('min_news_id\s+=\s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break

4 Comments

That's not hard in selenium: driver.execute_script("return min_news_id")
that's return current value, not monitor value on change. but its not hard if using element change.
Just put it in a loop with a sleep
I'm glad you agree :) Involving a browser adds overhead but it often simplifies the problem.
0

thank you for the response, Finally I solved using requests package after reading its documentation,

here is my code :

if InShortsScraper.firstLoad == True:
            self.pattern = re.compile('var min_news_id = (.+?);')
        else:
            self.pattern = re.compile('min_news_id = (.+?);')
        page = None
        # print("Pattern: " + str(self.pattern))
        if news_offset == None:
            htmlPage = urlopen(url)
            page = bs(htmlPage, "html.parser")
        else:
            self.loadMore['news_offset'] = InShortsScraper.newsOffset
            # print("payload : " + str(self.loadMore))
            try:
                r = myRequest.post(
                    url = url,
                    data = self.loadMore
                )
            except TypeError:
                print("Error in loading")

            InShortsScraper.newsOffset = r.json()["min_news_id"]
            page = bs(r.json()["html"], "html.parser")
        #print(page)
        if InShortsScraper.newsOffset == None:
            scripts = page.find_all("script")
            for script in scripts:
                for line in script:
                    scriptString = str(line)
                    if "min_news_id" in scriptString:
                        finder = re.findall(self.pattern, scriptString)
                        InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.