1

So, I'm new to python and am trying to develop an exercise in which I scrape the page numbers from a list on this url, which is a list of various published papers.

When I go into the HTML element for the page I want to scrape, I inspect the element and find this HTML code to match up:

<div class="src">
        Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
    </div>

The part that I want to churn out what is in between the class brackets. This is what I attempted to write in order to do the job.

import requests
from bs4 import BeautifulSoup

url = "http://www.jstor.org/action/doAdvancedSearch?c4=AND&c5=AND&q2=&pt=&q1=nuclear&f3=all&f1=all&c3=AND&c6=AND&q6=&f4=all&q4=&f0=all&c2=AND&q3=&acc=off&c1=AND&isbn=&q0=china+&f6=all&la=&f2=all&ed=2001&q5=&f5=all&group=none&sd=2000"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("div class='src'")
for link in links:
    print 

I know that this code is unfinished and that's because I don't know where to go from here :/. Can anyone help me here?

3
  • You want the tag text? Like: "Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63" ? Commented Sep 16, 2016 at 20:30
  • Tip: Check the sites terms Prohibited Uses of the Content. (d) undertake any activity such as the use of computer programs that automatically download or export Content, commonly known as web robots, spiders, crawlers, wanderers or accelerators that may interfere with, disrupt or otherwise burden the JSTOR server(s) Commented Sep 16, 2016 at 20:36
  • Did you actually try to read the documentation? crummy.com/software/BeautifulSoup/bs4/doc Commented Sep 16, 2016 at 22:44

2 Answers 2

2

An alternative to Tales Pádua's answer is this:

from bs4 import BeautifulSoup

html = """<div class="src">
    Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
</div>
<div class="src">
    Other Book, Vol. 1, No. 1 (Jul. - Aug., 2000), pp. 1-23
</div>"""
soup = BeautifulSoup(html)
links = soup.find_all("div", class_ = "src")
for link in links:
    print link.text.strip()

This outputs:

Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
Other Book, Vol. 1, No. 1 (Jul. - Aug., 2000), pp. 1-23

This answer uses the class_ parameter, which is recommended in the documentation.


If you are looking to get the page number, and everything follows the format above (comma separated), you can change the for loop to grab the last element of the string:

print link.text.split(",")[-1].strip()

This outputs:

pp. 53-63
pp. 1-23
Sign up to request clarification or add additional context in comments.

1 Comment

class_ is recommended for a css classes i.e div class="foo bar".
1

If I understand you correctly, you want the pages inside all divs with class="src"

If so, then you need to do:

import requests
import re
from bs4 import BeautifulSoup

url = "http://www.jstor.org/action/doAdvancedSearch?c4=AND&c5=AND&q2=&pt=&q1=nuclear&f3=all&f1=all&c3=AND&c6=AND&q6=&f4=all&q4=&f0=all&c2=AND&q3=&acc=off&c1=AND&isbn=&q0=china+&f6=all&la=&f2=all&ed=2001&q5=&f5=all&group=none&sd=2000"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all('div', {'class':'src'})
for link in links:
    pages = re.search('(pp.\s*\d*-\d*)', link.text)
    print pages.group(1)

Note that I have used regex to get the page numbers. This may sound strange for people unfamiliar with regular expressions, but I think its more elegant than using string operations like strip and split

5 Comments

Oh, this is perfect. Thank you very much!
If I only wanted to print a segment of that string of text (say "pp. 53-63") how would I write that into the code?
you could do print link.text[-9:] if this information is always in the end of the string
edited to include a good way for you to get the pages with regex
More elegant than text.rsplit(None, 1)[1]?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.