0

This is my code to access a webpage but I need to add parameters to it: 1. First parameter is added by reading a line from file 2. Second parameter is a counter to continuously access pages

import urllib2
import json,os

f = open('codes','r')
for line in f.readlines():
    id = line.strip('\n')
    url = 'http://api.opencorporates.com/v0.2/companies/search?q=&jurisdiction_code=%s&per_page=26&current_status=Active&page=%d' 
    i = 0
    directory = id
    os.makedirs(directory)
    while True:
       i += 5
       req = urllib2.Request('%s%s%d' % (url,id, i))
       print req
       try:
          response = urllib2.urlopen('%s%s%d' % (url, id, i))
       except urllib2.HTTPError, e:
          break
       content = response.read()
       fo = str(i) + '.json'    
       OUTFILE = os.path.join(directory, fo)
       with open(OUTFILE, 'w') as f:
           f.write(content)

This keeps creating empty directories. I know something is wrong with the url parameters. How to rectify this?

3
  • I think your problem is in your call to Request. Off the top of my head, the string format looks wrong. Put the url you're requesting into a variable and print that and see what it looks like. Commented Dec 16, 2013 at 17:41
  • This is what it it is printing: api.opencorporates.com/v0.2/companies/… Appending the parameters at the end. Commented Dec 16, 2013 at 17:45
  • I'll add an answer then, I see exactly what the problem is. Commented Dec 16, 2013 at 17:47

2 Answers 2

2

It looks like what you want to do is to insert id and i into url, but the string formatting you're using here concatenates url, id, and i. Try changing this:

req = urllib2.Request('%s%s%d' % (url,id, i))

Into this:

req = urllib2.Request(url % (id, i))

Does that give you the result you want?

Also, the string formatting syntax you are using is deprecated; the currently preferred syntax is detailed in PEP 3101 -- Advanced String Formatting. So even better would be to do:

url = 'http://api.opencorporates.com/v0.2/companies/search?q=&jurisdiction_code={0}&per_page=26&current_status=Active&page={1}'
...
req = urllib2.Request(url.format(id, i))

Instead of %s and %d you use curly braces ({}) as placeholders for your parameters. Inside the curly braces, you can put a tuple index:

>>> 'I like to {0}, {0}, {0}, {1} and {2}'.format('eat', 'apples', 'bananas')
'I like to eat, eat, eat, apples and bananas'

If you just use bare curly braces, each placeholder consumes one parameter, and extras are ignored; e.g.:

>>> '{} and {} and {}'.format(1, 2, 3)
'1 and 2 and 3'
>>> '{} and {} and {}'.format(1, 2, 3, 4)
'1 and 2 and 3'
>>> '{} and {} and {}'.format(1, 2)

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    '{} and {} and {}'.format(1, 2)
IndexError: tuple index out of range

You can also use keyword arguments, and therefore dictionary unpacking:

>>> d = {'adj':'funky', 'noun':'cheese', 'pronoun':'him'}
>>> 'The {adj} {noun} intrigued {pronoun}.'.format(**d)
'The funky cheese intrigued him.'

There are more features, detailed in the PEP, if you're interested.

Sign up to request clarification or add additional context in comments.

Comments

0

You need to change these bits:

    '%s%s%d' % (url,id, i)

To this:

    url % (id, i)

What you're doing now is creating a string like '<url><id><i>' instead of substituting in the string.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.