2

I'm fetching webpages with the use of curl and storing it in a variable in python.

var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'

I just want the links from the string for example:

"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg"

I Tried matching with regular expressions by defining the start of regular expression as "(https|http) and end as ":

x = re.findall(r'"(https|http)*"$', var)
print(x)

But I'm not getting the output. Please help me with this, thanks in advance.

>>>[]
5
  • 1
    stackoverflow.com/questions/1080411/… Commented Jun 16, 2018 at 11:25
  • Note: There will not be only one url in the string, the curl may fetch multiple urls in string. Commented Jun 16, 2018 at 11:25
  • 1
    maybe the modules request and or BeautifulSoup are something for you. They can do quite easy what you want Commented Jun 16, 2018 at 11:29
  • stackoverflow.com/a/1732454/5710637 Commented Jun 16, 2018 at 12:04
  • do you need ALL of the links? even the ones which refer to stylesheets, javascripts, external links (be it starting with // and http(s)://' ) and internal links (absolute with /path/to` and relative with path/to) alike? Commented Jun 16, 2018 at 12:04

3 Answers 3

2

Using re.search

import re
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
m = re.search("src=\"(?P<url>.*?)\"", var)
if m:
    print m.group('url')

Output:

https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, but the beginning may not be "src=\", it may be anything like "href=\" the pattern won't work in such cases, right?
In that case use m = re.search("(src|href)=\"(?P<url>.*?)\"", var)
1

@Manoj, you can also retrieve the value of src attribute using the split() method multiple times as follows.

» Using lambda function (1 line statement)

var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'

get_url = lambda html: html.split('=')[1].split('\"')[1]
print(get_url(var))

» Output

https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg

Let's expand the above approach in multiple statements to understand the actual direct process.

var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
print(var, "\n")
# <body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">

parts1 = var.split("=")
print(parts1, "\n")
# ['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']

parts2 = parts1[1].split('\"')
print(parts2, "\n")
# ['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']

print(parts2[1])
# https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg

» Output

E:\Users\Rishikesh\Python3\Practice\Temp>python GetUrls.py
<body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">

['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']

['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']



https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg

Comments

0

Using beautifulsoup you could search for a or img and check for the attributes:

For example:

from bs4 import BeautifulSoup as soup

var = '<body><a href=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\"><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/></a><div class=\"wrapper\">'
page_soup = soup(var, "html.parser")

links = []

for elm in page_soup.findAll(['a', 'img']):
    if elm.has_attr('href'):
        links.append(elm.get('href'))
    if elm.has_attr('src'):
        links.append(elm.get('src'))

print(links)

Demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.