How can I extract the following links from html source code in python?

Question

Here is my some html source code :

<div class="s">
   <div class="th N3nEGc" style="height:48px;width:61px">
<a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&amp;imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&amp;h=912&amp;w=1140&amp;tbnid=10DzCgmImE0jM&amp;tbnh=201&amp;tbnw=251&amp;usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&amp;docid=0vImrzSjsr5zQM"
         data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ"
         ping="/urlsa=t&amp;source=web&amp;rct=j&amp;url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&amp;ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">
      </a>
   </div>
</div>

What I want to extract is the link: <a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&

so the output will be in that way,

https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg

What I tried by using python is :

 sourceCode = opener.open(googlePath).read().decode('utf-8')
 links = re.findall('href="/imgres?imgurl=(.*?)jpg&amp;imgrefurl="',sourceCode)
 for i in links:
    print(i)

Andrej Kesely · Accepted Answer · 2019-07-02 07:16:33Z

2

Better way than parse query string through regex is using parse_qs function (safer, you get exactly what you want without regex fiddling) (doc):

data = '''<div class="s"><div class="th N3nEGc" style="height:48px;width:61px"><a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&amp;imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&amp;h=912&amp;w=1140&amp;tbnid=10DzCgmImE0jM&amp;tbnh=201&amp;tbnw=251&amp;usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&amp;docid=0vImrzSjsr5zQM" data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ" ping="/urlsa=t&amp;source=web&amp;rct=j&amp;url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&amp;ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">'''

from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs

soup = BeautifulSoup(data, 'lxml')

d = urlparse(soup.select_one('a[href*="imgurl"]')['href'])
q = parse_qs(d.query)

print(q['imgurl'])

Prints:

['https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg']

edited Jul 2, 2019 at 7:16

answered Jul 2, 2019 at 6:48

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

sodmzs Over a year ago

Hi, your code is working fine with the above piece of code, but when I try to process it with whole the source code then its showing me the following error. print(q['imgurl']) KeyError: 'imgurl'

Andrej Kesely Over a year ago

@sodmzs You need to select the right element from the soup. I updated my code.

sodmzs Over a year ago

Thanks its working now but it's just showing me only one link when there are multiple links available in the source code within the same tags like the piece of source code in my question.

Andrej Kesely Over a year ago

@sodmzs you need then use select() method, not select_one() and use for-loop to go through all the links selected by this method.

YusufUMS · Accepted Answer · 2019-07-02 06:45:32Z

0

If the problem is your regex, then I think you can try this one:

link = re.search('^https?:\/\/.*[\r\n]*[^.\\,:;]', sourceCode)
link = link.group()
print (link)

answered Jul 2, 2019 at 6:45

YusufUMS

1,4931 gold badge14 silver badges24 bronze badges

Comments

gnalog · Accepted Answer · 2019-07-05 06:11:46Z

0

Perhaps you should add an escape character for '?', try out this :

links = re.findall('href="/imgres\?imgurl=(.*?)jpg&amp;imgrefurl="',sourceCode)
for i in links:
    print(i)

edited Jul 5, 2019 at 6:11

answered Jul 5, 2019 at 2:55

gnalog

54 bronze badges

Collectives™ on Stack Overflow

How can I extract the following links from html source code in python?

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related