Python URLs in file Requests

Question

I have a problem with my Python script in which I want to scrape the same content from every website. I have a file with a lot of URLs and I want Python to go over them to place them into the requests.get(url) object. After that I write the output to a file named 'somefile.txt'.

I have to the following Python script (version 2.7 - Windows 8):

from lxml import html
import requests

urls = ('URL1',
'URL2',
'URL3'
    )

for url in urls:
    page = requests.get(url)


tree = html.fromstring(page.text)

visitors = tree.xpath('//b["no-visitors"]/text()')

print 'Visitors: ', visitors

f = open('somefile.txt', 'a')
    print >> f, 'Visitors:', visitors  # or f.write('...\n')
    f.close()

As you can see if have not included the file with the URLs in the script. I tried out many tutorials but failed. The filename would be 'urllist.txt'. In the current script I only get the data from URL3 - in an ideal case I want to get all data from urllist.txt.

Attempt for reading over the text file:

with open('urllist.txt', 'r') as f: #text file containing the URLS
     for url in f:
     page = requests.get(url)

Fair point. I added one of my attempts at the end.

Andre de Vries
– Andre de Vries

2015-05-13 11:17:59 +00:00
Commented May 13, 2015 at 11:17 — Andre de Vries
– Andre de Vries, Commented May 13, 2015 at 11:17

Martijn Pieters · Accepted Answer · 2015-05-13 11:35:43Z

1

You'll need to remove the newline from your lines:

with open('urllist.txt', 'r') as f: #text file containing the URLS
     for url in f:
         page = requests.get(url.strip())

The str.strip() call removes all whitespace (including tabs and newlines and carriage returns) from the line.

Do make sure you then process page in the loop; if you run your code to extract the data outside the loop all you'll get is the data from the last response you loaded. You may as well open the output file just once, in the with statement so Python closes it again:

with open('urllist.txt', 'r') as urls, open('somefile.txt', 'a') as output:
     for url in urls:
         page = requests.get(url.strip())

         tree = html.fromstring(page.content)
         visitors = tree.xpath('//b["no-visitors"]/text()')
         print 'Visitors: ', visitors
         print >> output, 'Visitors:', visitors

edited May 13, 2015 at 11:35

answered May 13, 2015 at 11:19

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Werner Smit · Accepted Answer · 2015-05-13 11:23:03Z

0

You should either save the each page in a seperate variable, or perform all the computation within the looping of the url list.

Based on your code, by the time your page parsing happens it will only contain the data for the last page get since you are overriding the page variable within each iteration.

Something like the following should append all the pages' info.

for url in urls:
    page = requests.get(url)


    tree = html.fromstring(page.text)

    visitors = tree.xpath('//b["no-visitors"]/text()')

    print 'Visitors: ', visitors

    f = open('somefile.txt', 'a')
        print >> f, 'Visitors:', visitors  # or f.write('...\n')
        f.close()

answered May 13, 2015 at 11:23

Werner Smit

2,0711 gold badge14 silver badges9 bronze badges

Collectives™ on Stack Overflow

Python URLs in file Requests

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related