Python string formatting for urls

Question

I'm using BeautifulSoup to scrap a webpage. I want to save each url in a list. However the operator + is not working properly. This is the code:

for a in soup.find_all('a', class_="hotel_name_link url"):
    hotel_url = "https://www.booking.com" + a['href']
    hotels_url_list.append(hotel_url)

I have to do it this way because the a['href'] attribute only gets the file location in the server but not the whole url (for example:

/hotel/es/aqua-aquamarina.es.html?label=gen173nr-1BCAEoggJCAlhYSDNYBGigAYgBAZgBCrgBB8gBDNgBAegBAZICAXmoAgM;sid=aa0d6c563b3d74f5432fb5d5b250eee4;ucfs=1;srpvid=2d5d1564170400e8;srepoch=1514343753;room1=A%2CA;hpos=15;hapos=15;dest_type=country;dest_id=197;srfid=198499756e07f93263596e1640823813c2ee4fe1X15;from=searchresults
;highlight_room=#hotelTmpl)

But when I print the results it displays the following:

What can I do to concat the urls in a way that BeautifulSoup can handle?

what is wrong with those urls ? They look OK. Did you try them to load pages ? Maybe only your IDE display it in different way. — furas
– furas, Commented Dec 27, 2017 at 3:21
if I use strip() - "https://www.booking.com" + a['href'].strip() - then I get url which requests can read with status 200. — furas
– furas, Commented Dec 27, 2017 at 3:44
almost worked but the last page of the url is still out of place @furas — Ian Spitz
– Ian Spitz, Commented Dec 27, 2017 at 4:15

eLRuLL · Accepted Answer · 2017-12-27 03:12:53Z

1

you can use urljoin:

from urlparse import urljoin


hotel_url = urljoin("https://www.booking.com", a['href'])

answered Dec 27, 2017 at 3:12

eLRuLL

18.8k9 gold badges79 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python string formatting for urls

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related