I'm trying to scrape two webpages with the following links:
https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074' https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482
I want to extract information about each house in the links. I use selenium and not beautifulsoup because the page is dynamic and beautifulsoup does not retrieve all the HTML-code. I use the code below trying to achieve this.
page_links=['https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074',
'https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482']
def render_page(url):
driver = webdriver.Firefox()
driver.get(url)
time.sleep(3)
r = driver.page_source
driver.quit()
return(r)
def remove_html_tags(text):
clean = re.compile('<.*?>')
return(re.sub(clean, '', text))
houses_html_code = []
housing_data = []
address = []
# Loop through main pages, render them and extract code
for i in page_links:
html = render_page(str(i))
soup = BeautifulSoup(html, "html.parser")
houses_html_code.append(soup)
for i in houses_html_code:
for span_1 in soup.findAll('span', {"class": "AdFeatures__item-value"}):
housing_data.append(remove_html_tags(str(span_1)))
So I summary I render the pages, get the page source, append the page source to a list and search for a span class in the pages sources of the two rendered pages.
However, my code returns the page source of the first link TWICE practically ignoring the second-page link even though it renders each page (firefox pops up with each page). See output below.
Why is this not working? Sorry if the answer is obvious. I'm rather new to Python and it is my first time using selenium
['Lejlighed',
'82 m²',
'2',
'5. sal',
'Nej',
'Ja',
'Nej',
'-',
'Ubegrænset',
'Snarest',
'8.542,-',
'-',
'25.626,-',
'-',
'34.168,-',
'24/08-2018',
'3775136',
'Lejlighed',
'82 m²',
'2',
'5. sal',
'Nej',
'Ja',
'Nej',
'-',
'Ubegrænset',
'Snarest',
'8.542,-',
'-',
'25.626,-',
'-',
'34.168,-',
'24/08-2018',
'3775136']