Dealing with xhr requests in Python HTML Scraping

Question

I need to scrape the entire HTML from journal_url, which for the purpose of this example will be http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1467-6281/issues . I have followed the requests examples displayed on a few questions on this site, but I am not getting the correct HTML returned with either the .text or .json() methods for requests.get. My goal is to display the whole HTML including the ordered list underneath each year and volume pull-down.

import requests
import pandas as pd
import http.cookiejar
for i in range(0,len(df)):
         journal_name = df.loc[i,"Journal Full Title"]
         journal_url = df.loc[i,"URL"]+"/issues"
         access_start = df.loc[i,"Content Start Date"]
         access_end = df.loc[i,"Content End Date"]
         #cj = http.cookiejar.CookieJar()
         #opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))

         headers = {"X-Requested-With": "XMLHttpRequest",
               "User-Agent": "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"}

         r = requests.get(journal_url, headers=headers)

         response = r.text
         print(response)

From my understanding, this states we want the content via an XML request (stackoverflow.com/questions/28610376/…). — Min
– Min, Commented Aug 25, 2017 at 16:32

SIM · Accepted Answer · 2017-08-20 09:29:31Z

1

If your ultimate goal is to parse the content you mentioned above from that page, then here it is:

import requests ; from bs4 import BeautifulSoup

base_link = "http://onlinelibrary.wiley.com" ; main_link = "http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1467-6281/issues"

def abacus_scraper(main_link):
    soup = BeautifulSoup(requests.get(main_link).text, "html.parser")
    for titles in soup.select("a.issuesInYear"):
        title = titles.select("span")[0].text
        title_link = titles.get("href")
        main_content(title, title_link)

def main_content(item, link):
    broth = BeautifulSoup(requests.get(base_link + link).text, "html.parser")
    elems = [issue.text for issue in broth.select("div.issue a")]
    print(item, elems)

abacus_scraper(main_link)

answered Aug 20, 2017 at 9:29

SIM

22.5k6 gold badges45 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Min Over a year ago

I apolagize for the delayed response, but this doesn'

SIM Over a year ago

"This doesn't" doesn't always clarify things. you meant, the data you are not after, right?

Min Over a year ago

Sorry, that was an accidental post. This should get me to what I need. Thanks!

Collectives™ on Stack Overflow

Dealing with xhr requests in Python HTML Scraping

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related