3

I'm creating a web-scraper and am trying to request multiple urls that share the same url path except for a numbered id.

My code to scrape one url is as follows:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://beta.companieshouse.gov.uk/company/00930291/officers')
soup = bs(r.content, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)

The url shares the same structure except for the company numbers. I've tried the following code to try and get it to scrape multiple pages, but without success:

import requests
from bs4 import BeautifulSoup as bs

pages = []

for i in range(11003058, 11003059, 00930291):
```url = 'https://beta.companieshouse.gov.uk/company/' + str(i) + '/officers'
```pages.append(url)

for item in pages:
```page = requests.get(item)
```soup = bs(page.text, 'lxml')

names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]

print(names)

This is only giving me the first page (/11003058/officers), why is it not looping through them? Can anyone help?

3 Answers 3

1

That should resolve your problems:

The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default), and ends at a specified number.

Syntax:

 range(start, stop, step)

https://docs.python.org/3/library/functions.html#func-range

Replace your code to:

company_id = ["11003058","11003059","00930291"]

for i in company_id:
    url = 'https://beta.companieshouse.gov.uk/company/' + str(i) + '/officers'
    pages.append(url)

You should initialize soup as list before iterate pages:

soup = [ ]

And append in soup list:

for item in pages:
  page = requests.get(item)
  soup.append(bs(page.text, 'lxml'))

print names list:

names = []
for items in soup:
    h2Obj = items.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')
    for i in h2Obj:
        tagArray = i.findChildren()
        for tag in tagArray:
            if isinstance(tag,Tag) and tag.name in 'a':
                names.append(tag.text)

O/P:

['MASRAT, Suheel', 'MARSHALL, Jack', 'SUTTON, Tim', 'COOMBES, John Frederick', 'BROWN, Alistair Stuart', 'COOMBES, Kenneth', 'LAFONT, Jean-Jacques Mathieu', 'THOMAS-KEEPING, Lindsay Charles', 'WILLIAMS, Janet Elizabeth', 'WILLIAMS, Roderick', 'WRAGG, Barry']

Add top of the script:

from bs4.element import Tag

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for helping. This works, but only shows the urls that it's sourced. How can I incorporate the names= section, so that it prints the names rather than the pages url?
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
I've added the names=... just after the soup.append... on a new line and added print(names), but it come up with this error: AttributeError: 'list' object has no attribute 'select'. What am i doing wrong :-(
because names have list instance type, you need to iterate one more time
You are an absolute star!! That works perfectly, thank you so much :-)
0

The syntax for range is range(start, stop, step). It loops from start to stop - 1 and increases by step each time. You're doing something weird here because in your case stop equals start + 1 so it is only going to loop once, with the start value.

I suppose you just want to get those 3 urls :

for i in (11003058, 11003059, 00930291):

Comments

0

Range in loops: The loop always includes start_value and excludes end_value during iteration

Try this:

import requests
from bs4 import BeautifulSoup as bs

pages = ['11003058', '11003059', '00930291']
i=0
while i<len(pages):
  url = 'https://beta.companieshouse.gov.uk/company/' + pages(i) + '/officers'
  pages.append(url)
  i+1

for item in pages:
  page = requests.get(item)
  soup = bs(page.text, 'lxml')

names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]

print(names)

2 Comments

Thank you for helping. I've tried your code, but it comes up with the error: File "444.py", line 6, in <module> while i<len(pages): NameError: name 'i' is not defined
oh, sorry i forgot to define ' i ', now you can run it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.