0

I have a problem with my Python script in which I want to scrape the same content from every website. I have a file with a lot of URLs and I want Python to go over them to place them into the requests.get(url) object. After that I write the output to a file named 'somefile.txt'.

I have to the following Python script (version 2.7 - Windows 8):

from lxml import html
import requests

urls = ('URL1',
'URL2',
'URL3'
    )

for url in urls:
    page = requests.get(url)


tree = html.fromstring(page.text)

visitors = tree.xpath('//b["no-visitors"]/text()')

print 'Visitors: ', visitors

f = open('somefile.txt', 'a')
    print >> f, 'Visitors:', visitors  # or f.write('...\n')
    f.close() 

As you can see if have not included the file with the URLs in the script. I tried out many tutorials but failed. The filename would be 'urllist.txt'. In the current script I only get the data from URL3 - in an ideal case I want to get all data from urllist.txt.

Attempt for reading over the text file:

with open('urllist.txt', 'r') as f: #text file containing the URLS
     for url in f:
     page = requests.get(url)
1
  • Fair point. I added one of my attempts at the end. Commented May 13, 2015 at 11:17

2 Answers 2

1

You'll need to remove the newline from your lines:

with open('urllist.txt', 'r') as f: #text file containing the URLS
     for url in f:
         page = requests.get(url.strip())

The str.strip() call removes all whitespace (including tabs and newlines and carriage returns) from the line.

Do make sure you then process page in the loop; if you run your code to extract the data outside the loop all you'll get is the data from the last response you loaded. You may as well open the output file just once, in the with statement so Python closes it again:

with open('urllist.txt', 'r') as urls, open('somefile.txt', 'a') as output:
     for url in urls:
         page = requests.get(url.strip())

         tree = html.fromstring(page.content)
         visitors = tree.xpath('//b["no-visitors"]/text()')
         print 'Visitors: ', visitors
         print >> output, 'Visitors:', visitors
Sign up to request clarification or add additional context in comments.

Comments

0

You should either save the each page in a seperate variable, or perform all the computation within the looping of the url list.

Based on your code, by the time your page parsing happens it will only contain the data for the last page get since you are overriding the page variable within each iteration.

Something like the following should append all the pages' info.

for url in urls:
    page = requests.get(url)


    tree = html.fromstring(page.text)

    visitors = tree.xpath('//b["no-visitors"]/text()')

    print 'Visitors: ', visitors

    f = open('somefile.txt', 'a')
        print >> f, 'Visitors:', visitors  # or f.write('...\n')
        f.close() 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.