Request multiple urls in Python scraping script

Question

I'm creating a web-scraper and am trying to request multiple urls that share the same url path except for a numbered id.

My code to scrape one url is as follows:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://beta.companieshouse.gov.uk/company/00930291/officers')
soup = bs(r.content, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)

The url shares the same structure except for the company numbers. I've tried the following code to try and get it to scrape multiple pages, but without success:

import requests
from bs4 import BeautifulSoup as bs

pages = []

for i in range(11003058, 11003059, 00930291):
```url = 'https://beta.companieshouse.gov.uk/company/' + str(i) + '/officers'
```pages.append(url)

for item in pages:
```page = requests.get(item)
```soup = bs(page.text, 'lxml')

names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]

print(names)

This is only giving me the first page (/11003058/officers), why is it not looping through them? Can anyone help?

bharatk · Accepted Answer · 2019-05-13 11:30:42Z

1

That should resolve your problems:

The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default), and ends at a specified number.

Syntax:

 range(start, stop, step)

https://docs.python.org/3/library/functions.html#func-range

Replace your code to:

company_id = ["11003058","11003059","00930291"]

for i in company_id:
    url = 'https://beta.companieshouse.gov.uk/company/' + str(i) + '/officers'
    pages.append(url)

You should initialize soup as list before iterate pages:

soup = [ ]

And append in soup list:

for item in pages:
  page = requests.get(item)
  soup.append(bs(page.text, 'lxml'))

print names list:

names = []
for items in soup:
    h2Obj = items.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')
    for i in h2Obj:
        tagArray = i.findChildren()
        for tag in tagArray:
            if isinstance(tag,Tag) and tag.name in 'a':
                names.append(tag.text)

O/P:

['MASRAT, Suheel', 'MARSHALL, Jack', 'SUTTON, Tim', 'COOMBES, John Frederick', 'BROWN, Alistair Stuart', 'COOMBES, Kenneth', 'LAFONT, Jean-Jacques Mathieu', 'THOMAS-KEEPING, Lindsay Charles', 'WILLIAMS, Janet Elizabeth', 'WILLIAMS, Roderick', 'WRAGG, Barry']

Add top of the script:

from bs4.element import Tag

edited May 13, 2019 at 11:30

answered May 13, 2019 at 10:44

bharatk

4,3455 gold badges18 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

MPN Over a year ago

Thanks for helping. This works, but only shows the urls that it's sourced. How can I incorporate the names= section, so that it prints the names rather than the pages url?

MPN Over a year ago

names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]

MPN Over a year ago

I've added the names=... just after the soup.append... on a new line and added print(names), but it come up with this error: AttributeError: 'list' object has no attribute 'select'. What am i doing wrong :-(

bharatk Over a year ago

because names have list instance type, you need to iterate one more time

MPN Over a year ago

You are an absolute star!! That works perfectly, thank you so much :-)

Ashargin · Accepted Answer · 2019-05-13 10:57:04Z

0

The syntax for range is range(start, stop, step). It loops from start to stop - 1 and increases by step each time. You're doing something weird here because in your case stop equals start + 1 so it is only going to loop once, with the start value.

I suppose you just want to get those 3 urls :

for i in (11003058, 11003059, 00930291):

edited May 13, 2019 at 10:57

answered May 13, 2019 at 10:50

Ashargin

5064 silver badges11 bronze badges

Comments

ashishmishra · Accepted Answer · 2019-05-13 11:39:01Z

0

Range in loops: The loop always includes start_value and excludes end_value during iteration

Try this:

import requests
from bs4 import BeautifulSoup as bs

pages = ['11003058', '11003059', '00930291']
i=0
while i<len(pages):
  url = 'https://beta.companieshouse.gov.uk/company/' + pages(i) + '/officers'
  pages.append(url)
  i+1

for item in pages:
  page = requests.get(item)
  soup = bs(page.text, 'lxml')

names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]

print(names)

edited May 13, 2019 at 11:39

answered May 13, 2019 at 10:55

ashishmishra

5114 silver badges19 bronze badges

2 Comments

MPN Over a year ago

Thank you for helping. I've tried your code, but it comes up with the error: File "444.py", line 6, in <module> while i<len(pages): NameError: name 'i' is not defined

ashishmishra Over a year ago

oh, sorry i forgot to define ' i ', now you can run it.

Collectives™ on Stack Overflow

Request multiple urls in Python scraping script

3 Answers 3

5 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related