How to scrape src from img html in python

Question

I'm trying to scrape the src of the img, but the code I found returns many img src, but not the one I want. I can't figure out what I am doing wrong. I am scraping TripAdvisor on "https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html"

So this is the HTML snippet I'm trying to extract from:

 <div class="restaurants-detail-overview-cards-LocationOverviewCard__cardColumn--2ALwF"><h6>Placering og kontaktoplysninger</h6><span><div><span data-test-target="staticMapSnapshot" class=""><img class="restaurants-detail-overview-cards-LocationOverviewCard__mapImage--22-Al" src="https://trip-raster.citymaps.io/staticmap?scale=1&amp;zoom=15&amp;size=347x137&amp;language=da&amp;center=55.687988,12.596316&amp;markers=icon:http%3A%2F%2Fc1.tacdn.com%2F%2Fimg2%2Fmaps%2Ficons%2Fcomponent_map_pins_v1%2FR_Pin_Small.png|55.68799,12.596316"></span></div></span>

I want the code to return: (a sub-string from src)

55.68799,12.596316

I have tried:

    import pandas as pd
    pd.options.display.max_colwidth = 200
    from urllib.request import urlopen
    from bs4 import BeautifulSoup as bs
    import re

    web_url = "https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html"
    url = urlopen(web_url)
    url_html = url.read()

    soup = bs(url_html, 'lxml')
    soup.find_all('img')

    for link in soup.find_all('img'):
        print(link.get('src'))

the return is along the lines of this BUT NOT the src that I need :

https://static.tacdn.com/img2/branding/rebrand/TA_logo_secondary.svg
https://static.tacdn.com/img2/branding/rebrand/TA_logo_primary.svg 
https://static.tacdn.com/img2/branding/rebrand/TA_logo_secondary.svg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

If you really want help, you should tell us at least the library you are using... I used PyQuery for web scraping and it always worked like a charm — Tim Woocker
– Tim Woocker, Commented Aug 23, 2019 at 11:08
Thanks for your comment! Just updatet the text. I'm using BeautifulSoup and urllib at the moment — Suhr415
– Suhr415, Commented Aug 23, 2019 at 11:14
That's because the value you are looking is not returned in the url. — Kostas Charitidis
– Kostas Charitidis, Commented Aug 23, 2019 at 11:20
Thank you Kostas! Do you have an idea of how to go around it then, if it's not returned in the url? — Suhr415
– Suhr415, Commented Aug 23, 2019 at 11:26

QHarr · Accepted Answer · 2019-08-23 16:13:31Z

1

You can do this with just requests and re. It is only the co-ordinates part of the src which are the location based variable.

import requests, re

p = re.compile(r'"coords":"(.*?)"')
r = requests.get('https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html')
coords = p.findall(r.text)[1]
src = f'https://trip-raster.citymaps.io/staticmap?scale=1&zoom=15&size=347x137&language=da&center={coords}&markers=icon:http://c1.tacdn.com//img2/maps/icons/component_map_pins_v1/R_Pin_Small.png|{coords}'
print(src)
print(coords)

answered Aug 23, 2019 at 16:13

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Suhr415 Over a year ago

This works perfectly for me! Exactly what I need. Thank you for your time QHarr

Kostas Charitidis · Accepted Answer · 2019-08-23 11:33:42Z

1

Selenium is a workaround i tested it and works liek a charm. Here you are:

from selenium import webdriver

driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html")
links = driver.find_elements_by_xpath("//*[@src]")
urls = []
for link in links:
    url = link.get_attribute('src')
    if '|' in url:
        urls.append(url.split('|')[1])  # saves in a list only the numbers you want i.e. 55.68799,12.596316
    print(url)
print(urls)

Result of above ['55.68799,12.596316']

If you haven't used selenium before here you can find a webdriver https://chromedriver.storage.googleapis.com/index.html?path=2.46/

or here

https://sites.google.com/a/chromium.org/chromedriver/downloads

answered Aug 23, 2019 at 11:33

Kostas Charitidis

3,1231 gold badge15 silver badges23 bronze badges

1 Comment

Suhr415 Over a year ago

Thank you so much Kostas! It works nicely. I haven't used selenium before, so I'll check it out. Have a nice weekend!

Collectives™ on Stack Overflow

How to scrape src from img html in python

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related