1

I'm trying to scrape two webpages with the following links:

https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074' https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482

I want to extract information about each house in the links. I use selenium and not beautifulsoup because the page is dynamic and beautifulsoup does not retrieve all the HTML-code. I use the code below trying to achieve this.

page_links=['https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074',
'https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482']

def render_page(url):
    driver = webdriver.Firefox()
    driver.get(url)
    time.sleep(3)
    r = driver.page_source
    driver.quit()
    return(r)

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return(re.sub(clean, '', text))

houses_html_code = []
housing_data = []
address = []

# Loop through main pages, render them and extract code
for i in page_links: 
    html = render_page(str(i))
    soup = BeautifulSoup(html, "html.parser")
    houses_html_code.append(soup)

for i in houses_html_code:
    for span_1 in soup.findAll('span', {"class": "AdFeatures__item-value"}):
    housing_data.append(remove_html_tags(str(span_1)))

So I summary I render the pages, get the page source, append the page source to a list and search for a span class in the pages sources of the two rendered pages.

However, my code returns the page source of the first link TWICE practically ignoring the second-page link even though it renders each page (firefox pops up with each page). See output below.

Why is this not working? Sorry if the answer is obvious. I'm rather new to Python and it is my first time using selenium

['Lejlighed',
'82 m²',
'2',
'5. sal',
'Nej',
'Ja',
'Nej',
'-',
'Ubegrænset',
'Snarest',
'8.542,-',
'-',
'25.626,-',
'-',
'34.168,-',
'24/08-2018',
'3775136',
'Lejlighed',
'82 m²',
'2',
'5. sal',
'Nej',
'Ja',
'Nej',
'-',
'Ubegrænset',
'Snarest',
'8.542,-',
'-',
'25.626,-',
'-',
'34.168,-',
'24/08-2018',
'3775136']

1 Answer 1

0

You have a typo change:

for span_1 in soup.findAll('span', {"class": "AdFeatures__item-value"}):

to

for span_1 in i.findAll('span', {"class": "AdFeatures__item-value"}):

But why do you create a new webdriver for each page? Why not do something like this:

page_links=['https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074', 'https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482']
driver = webdriver.Firefox()

def render_page(url):
    driver.get(url)
    ...

...
for i in houses_html_code:
    for span_1 in i.findAll('span', {"class": "AdFeatures__item-value"}):
         housing_data.append(remove_html_tags(str(span_1)))

driver.quit()

Outputs:

['Lejlighed', '78 m²', '2', '2. sal', 'Nej', 'Nej', 'Nej', '-', 'Ubegrænset', 'Snarest', '5.300,-', '800,-', '15.900,-', '0,-', '22.000,-', '27/10-2018', '3864958', 'Lejlighed', '82 m²', '2', '5. sal', 'Nej', 'Ja', 'Nej', '-', 'Ubegrænset', 'Snarest', '8.542,-', '-', '25.626,-', '-', '34.168,-', '24/08-2018', '3775136']
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you a lot! Pretty obvious mistakes as I see what you point out now. But very useful anyhow. I agree with just initiating one webdriver for the whole session.
Your welcome. Also you could use "housing_data.append(span_1.get_text())" instead of "housing_data.append(remove_html_tags(str(span_1)))" and do away with your remove_html_tags(text) function.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.