I am trying to web scrape over 600 listings of a real state website. The name, price, area and valueperm2 are mandatory and all the pages have them, so it was easy to scrape them. But other features like amount of rooms, suites, garage and taxes prices are not mandatory and then I get a flexible length and order for elements on soup.findAll('h6',class_ ='mb-0 text-normal').
I've tried to create keys and values to store on the data dictionary but when I tried with k2 and v2 got the out of index, probably because there are only one optional features for some of the listings. Thought about using len(soup.findAll('h6',class_ ='mb-0 text-normal')) to create a conditional way to add those optional features, butg
productlinks = []
baseurl = 'https://www.dfimoveis.com.br/'
for x in range(1,40):
r = requests.get(f'https://www.dfimoveis.com.br/aluguel/df/todos/asa-norte/apartamento?pagina={x}')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('li', class_ = 'property-list__item')
for item in productlist:
for link in item.find_all('meta',itemprop = 'url'):
productlinks.append(baseurl + link['content'])
for link in productlinks:
r = requests.get(link)
soup = BeautifulSoup(r.content, 'lxml')
name = soup.find_all('h1', class_ = 'mb-0 font-weight-600 fs-1-5')[0].text.strip()
price = soup.find_all('small', class_ = 'display-5 text-warning')[2].text.strip()
area = soup.find_all('small', class_ = 'display-5 text-warning')[0].text.replace("m²","").strip()
valueperm2 = soup.find_all('small', class_ = 'display-5 text-warning')[1].text.strip()
k1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n ','').strip().split(':')[0]
v1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n ','').strip().split(':')[1].strip()
k2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n ','').strip().split(':')[0]
v2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n ','').strip().split(':')[1].strip()
data = {'name':name,
'value':value,
'area':area,
'valueperm2':valueperm2,
k1:v1,
k2:v2
}
and then I get the output
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-74-6ee7d6edeb81> in <module>
9 v1 = soup.findAll('h6',class_ ='mb-0 text-normal')[0].text.replace('\r\n ','').strip().split(':')[1].strip()
10 k2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n ','').strip().split(':')[0]
---> 11 v2 = soup.findAll('h6',class_ ='mb-0 text-normal')[1].text.replace('\r\n ','').strip().split(':')[1].strip()
12
13 ap = {'name':name,
IndexError: list index out of range