How to get url from a string in python

Question

I'm fetching webpages with the use of curl and storing it in a variable in python.

var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'

I just want the links from the string for example:

"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg"

I Tried matching with regular expressions by defining the start of regular expression as "(https|http) and end as ":

x = re.findall(r'"(https|http)*"$', var)
print(x)

But I'm not getting the output. Please help me with this, thanks in advance.

>>>[]

Note: There will not be only one url in the string, the curl may fetch multiple urls in string. — Manoj Jahgirdar
– Manoj Jahgirdar, Commented Jun 16, 2018 at 11:25
maybe the modules request and or BeautifulSoup are something for you. They can do quite easy what you want — The Fool
– The Fool, Commented Jun 16, 2018 at 11:29
do you need ALL of the links? even the ones which refer to stylesheets, javascripts, external links (be it starting with // and http(s)://' ) and internal links (absolute with /path/to` and relative with path/to) alike? — wiesion
– wiesion, Commented Jun 16, 2018 at 12:04

Rakesh · Accepted Answer · 2018-06-16 11:25:48Z

2

Using re.search

import re
var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
m = re.search("src=\"(?P<url>.*?)\"", var)
if m:
    print m.group('url')

Output:

https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg

answered Jun 16, 2018 at 11:25

Rakesh

82.9k17 gold badges85 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Manoj Jahgirdar Over a year ago

Thank you, but the beginning may not be "src=\", it may be anything like "href=\" the pattern won't work in such cases, right?

Rakesh Over a year ago

In that case use m = re.search("(src|href)=\"(?P<url>.*?)\"", var)

hygull · Accepted Answer · 2018-06-16 12:05:02Z

@Manoj, you can also retrieve the value of src attribute using the split() method multiple times as follows.

» Using lambda function (1 line statement)

var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'

get_url = lambda html: html.split('=')[1].split('\"')[1]
print(get_url(var))

» Output

https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg

Let's expand the above approach in multiple statements to understand the actual direct process.

var = '<body><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/><div class=\"wrapper\">'
print(var, "\n")
# <body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">

parts1 = var.split("=")
print(parts1, "\n")
# ['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']

parts2 = parts1[1].split('\"')
print(parts2, "\n")
# ['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']

print(parts2[1])
# https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg

» Output

E:\Users\Rishikesh\Python3\Practice\Temp>python GetUrls.py
<body><img src="https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style="display: none;"/><div class="wrapper">

['<body><img src', '"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg" style', '"display: none;"/><div class', '"wrapper">']

['', 'https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg', ' style']



https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg

The fourth bird · Accepted Answer · 2018-06-16 11:59:18Z

0

Using beautifulsoup you could search for a or img and check for the attributes:

For example:

from bs4 import BeautifulSoup as soup

var = '<body><a href=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\"><img src=\"https://cdn.neow.in/news/images/uploaded/2018/06/Gaming sloop, preview group, hardware _mediump.jpg\" style=\"display: none;\"/></a><div class=\"wrapper\">'
page_soup = soup(var, "html.parser")

links = []

for elm in page_soup.findAll(['a', 'img']):
    if elm.has_attr('href'):
        links.append(elm.get('href'))
    if elm.has_attr('src'):
        links.append(elm.get('src'))

print(links)

Demo

answered Jun 16, 2018 at 11:59

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Collectives™ on Stack Overflow

How to get url from a string in python

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related