Using selenium to loop through links and get page sources

Question

I'm trying to scrape two webpages with the following links:

https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074' https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482

I want to extract information about each house in the links. I use selenium and not beautifulsoup because the page is dynamic and beautifulsoup does not retrieve all the HTML-code. I use the code below trying to achieve this.

page_links=['https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074',
'https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482']

def render_page(url):
    driver = webdriver.Firefox()
    driver.get(url)
    time.sleep(3)
    r = driver.page_source
    driver.quit()
    return(r)

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return(re.sub(clean, '', text))

houses_html_code = []
housing_data = []
address = []

# Loop through main pages, render them and extract code
for i in page_links: 
    html = render_page(str(i))
    soup = BeautifulSoup(html, "html.parser")
    houses_html_code.append(soup)

for i in houses_html_code:
    for span_1 in soup.findAll('span', {"class": "AdFeatures__item-value"}):
    housing_data.append(remove_html_tags(str(span_1)))

So I summary I render the pages, get the page source, append the page source to a list and search for a span class in the pages sources of the two rendered pages.

However, my code returns the page source of the first link TWICE practically ignoring the second-page link even though it renders each page (firefox pops up with each page). See output below.

Why is this not working? Sorry if the answer is obvious. I'm rather new to Python and it is my first time using selenium

['Lejlighed',
'82 m²',
'2',
'5. sal',
'Nej',
'Ja',
'Nej',
'-',
'Ubegrænset',
'Snarest',
'8.542,-',
'-',
'25.626,-',
'-',
'34.168,-',
'24/08-2018',
'3775136',
'Lejlighed',
'82 m²',
'2',
'5. sal',
'Nej',
'Ja',
'Nej',
'-',
'Ubegrænset',
'Snarest',
'8.542,-',
'-',
'25.626,-',
'-',
'34.168,-',
'24/08-2018',
'3775136']

Dan-Dev · Accepted Answer · 2018-10-30 23:00:36Z

0

You have a typo change:

for span_1 in soup.findAll('span', {"class": "AdFeatures__item-value"}):

to

for span_1 in i.findAll('span', {"class": "AdFeatures__item-value"}):

But why do you create a new webdriver for each page? Why not do something like this:

page_links=['https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-holstebro/id-5792074', 'https://www.boligportal.dk/lejebolig/dp/2-vaerelses-lejlighed-odense-m/id-5769482']
driver = webdriver.Firefox()

def render_page(url):
    driver.get(url)
    ...

...
for i in houses_html_code:
    for span_1 in i.findAll('span', {"class": "AdFeatures__item-value"}):
         housing_data.append(remove_html_tags(str(span_1)))

driver.quit()

Outputs:

['Lejlighed', '78 m²', '2', '2. sal', 'Nej', 'Nej', 'Nej', '-', 'Ubegrænset', 'Snarest', '5.300,-', '800,-', '15.900,-', '0,-', '22.000,-', '27/10-2018', '3864958', 'Lejlighed', '82 m²', '2', '5. sal', 'Nej', 'Ja', 'Nej', '-', 'Ubegrænset', 'Snarest', '8.542,-', '-', '25.626,-', '-', '34.168,-', '24/08-2018', '3775136']

answered Oct 30, 2018 at 23:00

Dan-Dev

9,5783 gold badges42 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jonathan Marin Over a year ago

Thank you a lot! Pretty obvious mistakes as I see what you point out now. But very useful anyhow. I agree with just initiating one webdriver for the whole session.

Dan-Dev Over a year ago

Your welcome. Also you could use "housing_data.append(span_1.get_text())" instead of "housing_data.append(remove_html_tags(str(span_1)))" and do away with your remove_html_tags(text) function.

Collectives™ on Stack Overflow

Using selenium to loop through links and get page sources

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related