0

Here is my some html source code :

<div class="s">
   <div class="th N3nEGc" style="height:48px;width:61px">
<a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&amp;imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&amp;h=912&amp;w=1140&amp;tbnid=10DzCgmImE0jM&amp;tbnh=201&amp;tbnw=251&amp;usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&amp;docid=0vImrzSjsr5zQM"
         data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ"
         ping="/urlsa=t&amp;source=web&amp;rct=j&amp;url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&amp;ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">
      </a>
   </div>
</div>

What I want to extract is the link: <a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&amp;

so the output will be in that way,

https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg

What I tried by using python is :

 sourceCode = opener.open(googlePath).read().decode('utf-8')
 links = re.findall('href="/imgres?imgurl=(.*?)jpg&amp;imgrefurl="',sourceCode)
 for i in links:
    print(i)

3 Answers 3

2

Better way than parse query string through regex is using parse_qs function (safer, you get exactly what you want without regex fiddling) (doc):

data = '''<div class="s"><div class="th N3nEGc" style="height:48px;width:61px"><a href="/imgres?imgurl=https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg&amp;imgrefurl=https://linuxhint.com/setup_screensaver_manjaro_linux/&amp;h=912&amp;w=1140&amp;tbnid=10DzCgmImE0jM&amp;tbnh=201&amp;tbnw=251&amp;usg=K_YJsquLr4rorhW2ks8UdceQ8uKjg=&amp;docid=0vImrzSjsr5zQM" data-ved="2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ" ping="/urlsa=t&amp;source=web&amp;rct=j&amp;url=/imgres%3Fimgurl%3Dhttps://linuxhint.com/wpcontent/uploads/2018/12/11.jpg%26imgrefurl%3Dhttps://linuxhint.com/setup_screensaver_manjaro_linux/%26h%3D912%26w%3D1140%26tbnid%3D10DzCgmImE0jM%26tbnh%3D201%26tbnw%3D251%26usg%3DK_YJsquLr4rorhW2ks8UdceQ8uKjg%3D%26docid%3D0vImrzSjsr5zQM&amp;ved=2ahUKEwj3062g3pDjAhWZQN4KHS-_BL8Q8g0wC3oECAUQBQ">'''

from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs

soup = BeautifulSoup(data, 'lxml')

d = urlparse(soup.select_one('a[href*="imgurl"]')['href'])
q = parse_qs(d.query)

print(q['imgurl'])

Prints:

['https://linuxhint.com/wpcontent/uploads/2018/12/11.jpg']
Sign up to request clarification or add additional context in comments.

4 Comments

Hi, your code is working fine with the above piece of code, but when I try to process it with whole the source code then its showing me the following error. print(q['imgurl']) KeyError: 'imgurl'
@sodmzs You need to select the right element from the soup. I updated my code.
Thanks its working now but it's just showing me only one link when there are multiple links available in the source code within the same tags like the piece of source code in my question.
@sodmzs you need then use select() method, not select_one() and use for-loop to go through all the links selected by this method.
0

If the problem is your regex, then I think you can try this one:

link = re.search('^https?:\/\/.*[\r\n]*[^.\\,:;]', sourceCode)
link = link.group()
print (link)

Comments

0

Perhaps you should add an escape character for '?', try out this :

links = re.findall('href="/imgres\?imgurl=(.*?)jpg&amp;imgrefurl="',sourceCode)
for i in links:
    print(i)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.