1

I am using this script to scrape the author information from sciencedirect articles,but I am getting none when trying to print the value.

import requests
from bs4 import BeautifulSoup
from urllib import urlopen
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        site = urlopen(url)   
        soup = BeautifulSoup(site, "lxml")
        for item in soup.find_all("div", {"class": "AuthorGroups"}):
            final = item.text,url
            print final

In urls.txt I used these 2 urls (https://www.sciencedirect.com/science/article/pii/009286749290520M,https://www.sciencedirect.com/science/article/pii/0092867495903682)

2
  • Does it scrape other fields from sciencedirect, or does it work with other links in the textfile? It could be that ScienceDirect doesn't allow scraping. Commented Dec 7, 2018 at 8:16
  • I am not able to fetch anything from scienceDirect .But when I am using this program for other journals its working.And I am getting none when trying to print the value, but can be found in 'Inspect Element' Commented Dec 7, 2018 at 9:19

1 Answer 1

2

if BeautifulSoup not returned expected value, see html response from the server.

Your request blocked because it need to set proper user-agent.

.....
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'}
for url in urls:
    print url
    site = requests.get(url, headers=headers).text
    .....
Sign up to request clarification or add additional context in comments.

2 Comments

When I tried by including headers its working.Thank you
you're welcome. if it solved, please mark the answer as correct.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.