Python Web-scraping Solution

Question

So, I'm new to python and am trying to develop an exercise in which I scrape the page numbers from a list on this url, which is a list of various published papers.

When I go into the HTML element for the page I want to scrape, I inspect the element and find this HTML code to match up:

<div class="src">
        Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
    </div>

The part that I want to churn out what is in between the class brackets. This is what I attempted to write in order to do the job.

import requests
from bs4 import BeautifulSoup

url = "http://www.jstor.org/action/doAdvancedSearch?c4=AND&c5=AND&q2=&pt=&q1=nuclear&f3=all&f1=all&c3=AND&c6=AND&q6=&f4=all&q4=&f0=all&c2=AND&q3=&acc=off&c1=AND&isbn=&q0=china+&f6=all&la=&f2=all&ed=2001&q5=&f5=all&group=none&sd=2000"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("div class='src'")
for link in links:
    print

I know that this code is unfinished and that's because I don't know where to go from here :/. Can anyone help me here?

You want the tag text? Like: "Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63" ? — Tales Pádua
– Tales Pádua, Commented Sep 16, 2016 at 20:30
Tip: Check the sites terms Prohibited Uses of the Content. (d) undertake any activity such as the use of computer programs that automatically download or export Content, commonly known as web robots, spiders, crawlers, wanderers or accelerators that may interfere with, disrupt or otherwise burden the JSTOR server(s) — OneCricketeer
– OneCricketeer, Commented Sep 16, 2016 at 20:36
Did you actually try to read the documentation? crummy.com/software/BeautifulSoup/bs4/doc — Padraic Cunningham
– Padraic Cunningham, Commented Sep 16, 2016 at 22:44

Community · Accepted Answer · 2017-05-23 12:08:19Z

2

An alternative to Tales Pádua's answer is this:

from bs4 import BeautifulSoup

html = """<div class="src">
    Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
</div>
<div class="src">
    Other Book, Vol. 1, No. 1 (Jul. - Aug., 2000), pp. 1-23
</div>"""
soup = BeautifulSoup(html)
links = soup.find_all("div", class_ = "src")
for link in links:
    print link.text.strip()

This outputs:

Foreign Affairs, Vol. 79, No. 4 (Jul. - Aug., 2000), pp. 53-63
Other Book, Vol. 1, No. 1 (Jul. - Aug., 2000), pp. 1-23

This answer uses the class_ parameter, which is recommended in the documentation.

If you are looking to get the page number, and everything follows the format above (comma separated), you can change the for loop to grab the last element of the string:

print link.text.split(",")[-1].strip()

This outputs:

pp. 53-63
pp. 1-23

edited May 23, 2017 at 12:08

CommunityBot

11 silver badge

answered Sep 16, 2016 at 20:44

Andy♦

50.8k62 gold badges181 silver badges240 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Padraic Cunningham Over a year ago

class_ is recommended for a css classes i.e div class="foo bar".

Padraic Cunningham · Accepted Answer · 2016-09-16 22:46:56Z

1

If I understand you correctly, you want the pages inside all divs with class="src"

If so, then you need to do:

import requests
import re
from bs4 import BeautifulSoup

url = "http://www.jstor.org/action/doAdvancedSearch?c4=AND&c5=AND&q2=&pt=&q1=nuclear&f3=all&f1=all&c3=AND&c6=AND&q6=&f4=all&q4=&f0=all&c2=AND&q3=&acc=off&c1=AND&isbn=&q0=china+&f6=all&la=&f2=all&ed=2001&q5=&f5=all&group=none&sd=2000"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all('div', {'class':'src'})
for link in links:
    pages = re.search('(pp.\s*\d*-\d*)', link.text)
    print pages.group(1)

Note that I have used regex to get the page numbers. This may sound strange for people unfamiliar with regular expressions, but I think its more elegant than using string operations like strip and split

edited Sep 16, 2016 at 22:46

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

answered Sep 16, 2016 at 20:33

Tales Pádua

1,4811 gold badge19 silver badges36 bronze badges

5 Comments

Kainesplain Over a year ago

Oh, this is perfect. Thank you very much!

Kainesplain Over a year ago

If I only wanted to print a segment of that string of text (say "pp. 53-63") how would I write that into the code?

Tales Pádua Over a year ago

you could do print link.text[-9:] if this information is always in the end of the string

Tales Pádua Over a year ago

edited to include a good way for you to get the pages with regex

Padraic Cunningham Over a year ago

More elegant than text.rsplit(None, 1)[1]?

Collectives™ on Stack Overflow

Python Web-scraping Solution

2 Answers 2

1 Comment

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related