list index out of range while web scraping multiple pages

Question

I am trying to web scrape over 600 listings of a real state website. The name, price, area and valueperm2 are mandatory and all the pages have them, so it was easy to scrape them. But other features like amount of rooms, suites, garage and taxes prices are not mandatory and then I get a flexible length and order for elements on soup.findAll('h6',class_ ='mb-0 text-normal').

I've tried to create keys and values to store on the data dictionary but when I tried with k2 and v2 got the out of index, probably because there are only one optional features for some of the listings. Thought about using len(soup.findAll('h6',class_ ='mb-0 text-normal')) to create a conditional way to add those optional features, butg

productlinks = []
baseurl = 'https://www.dfimoveis.com.br/'
for x in range(1,40):
  r = requests.get(f'https://www.dfimoveis.com.br/aluguel/df/todos/asa-norte/apartamento?pagina={x}')
  soup = BeautifulSoup(r.content, 'lxml')
  productlist = soup.find_all('li', class_ = 'property-list__item')
  for item in productlist:
    for link in item.find_all('meta',itemprop = 'url'):
        productlinks.append(baseurl + link['content'])
for link in productlinks:
  r = requests.get(link)
  soup = BeautifulSoup(r.content, 'lxml')
  name = soup.find_all('h1', class_ = 'mb-0 font-weight-600 fs-1-5')[0].text.strip()
  price = soup.find_all('small', class_ = 'display-5 text-warning')[2].text.strip()
  area = soup.find_all('small', class_ = 'display-5 text-warning')[0].text.replace("m²","").strip()
  valueperm2 = soup.find_all('small', class_ = 'display-5 text-warning')[1].text.strip()
  k1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n                                            ','').strip().split(':')[0]
  v1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n                                            ','').strip().split(':')[1].strip()
  k2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n                                            ','').strip().split(':')[0]
  v2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n                                            ','').strip().split(':')[1].strip()
  data = {'name':name,
    'value':value,
    'area':area,
    'valueperm2':valueperm2,
     k1:v1,
     k2:v2
    }

and then I get the output

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-74-6ee7d6edeb81> in <module>
      9   v1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n                                            ','').strip().split(':')[1].strip()
     10   k2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n                                            ','').strip().split(':')[0]
---> 11   v2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n                                            ','').strip().split(':')[1].strip()
     12 
     13   ap = {'name':name,

IndexError: list index out of range

DrCorgi · Accepted Answer · 2022-07-31 03:55:07Z

1

I tried to run your code and am not able to reproduce the problems as I do not have 'baseUrl'.

However, you should be able to check for the length of "soup.findAll('h6',class_ ='mb-0 text-normal')" before assigning the individual items of the list into the v1, k2, v2 (etc) variables.

For example,

results = soup.findAll('h6',class_ ='mb-0 text-normal')
if len(results) >= 2:
  v1 = results[0].text.replace('\r\n                                            ','').strip().split(':')[1].strip()
  k2 = results[1].text.replace('\r\n                                            ','').strip().split(':')[0]
  v2 = results[1].text.replace('\r\n                                            ','')

You will likely to need to reorder this or amend this based on the specific logic you are implementing, but code along these lines should work.

answered Jul 31, 2022 at 3:55

DrCorgi

1664 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

bkraffa Over a year ago

Got it, thanks ! now im creating different dictionaries based on the amount of optional data the listings have by checking the length of "soup.findAll('h6',class_ ='mb-0 text-normal')"

Mazhar · Accepted Answer · 2022-07-31 05:00:51Z

This error happened due to following reason:

You want to extract text using ':'
And expected the length of the splitted data should be 2 (index 0 & 1)
'name:roy' -> ['name','roy'] Will work fine
'name' -> ['name'] Index 1 not available causing IndexError

A seperate function for extracting dynamic field from the page will be a better option to avoid (code repetative, unwanted error)

def dynamic_portion(soup):
    temp_data = {}
    for item in soup.findAll('h6',class_ ='mb-0 text-normal'):
        item = item.text.split(':')
        if len(item)==2:
            key,val = map(str.strip,item)
            temp_data[key]=val
    return temp_data

You can integrate it in your code like the following way:

productlinks = []
baseurl = 'https://www.dfimoveis.com.br/'
for x in range(1,40):
  r = requests.get(f'https://www.dfimoveis.com.br/aluguel/df/todos/asa-norte/apartamento?pagina={x}')
  soup = BeautifulSoup(r.content, 'lxml')
  productlist = soup.find_all('li', class_ = 'property-list__item')
  for item in productlist:
    for link in item.find_all('meta',itemprop = 'url'):
        productlinks.append(baseurl + link['content'])

for link in productlinks:
    r = requests.get(link)
    soup = BeautifulSoup(r.content, 'lxml')
    name = soup.find_all('h1', class_ = 'mb-0 font-weight-600 fs-1-5')[0].text.strip()
    value = 1
    price = soup.find_all('small', class_ = 'display-5 text-warning')[2].text.strip()
    area = soup.find_all('small', class_ = 'display-5 text-warning')[0].text.replace("m²","").strip()
    valueperm2 = soup.find_all('small', class_ = 'display-5 text-warning')[1].text.strip()
    data = {'name':name,
            'value':value,
            'area':area,
            'valueperm2':valueperm2
            }
    temp_data = dynamic_portion(soup)
    data.update(temp_data)
    break

Great code to deal with the dynamic portion of the data, cheers

Collectives™ on Stack Overflow

list index out of range while web scraping multiple pages

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related