2

I have a string:

test_string="lots of other html tags ,'https://news.sky.net/upload_files/image/2022/202209_166293.png',and still 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'"

How can I get the whole 2 urls in the string,by using python Regex ?

I tried:

pattern = 'https://news.sky.net/upload_files/image'
result = re.findall(pattern, test_string)

I can get a list:

['https://news.sky.net/upload_files/image','https://news.sky.net/upload_files/image']

but not the whole url ,so I tried:

pattern = 'https://news.sky.net/upload_files/image...$png'
result = re.findall(pattern, test_string)

But received an empty list.

3 Answers 3

2

You could match a minimal number of characters after image up to a . and either png or jpg:

test_string = "lots of other html tags ,'https://news.sky.net/upload_files/image/2022/202209_166293.png',and still 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'"
pattern = r'https://news.sky.net/upload_files/image.*?\.(?:png|jpg)'
re.findall(pattern, test_string)

Output:

[
 'https://news.sky.net/upload_files/image/2022/202209_166293.png',
 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'
]
Sign up to request clarification or add additional context in comments.

Comments

2

Assuming you would always expect the URLs to appear inside single quotes, we can use re.findall as follows:

I have a string:

test_string = "lots of other html tags ,'https://news.sky.net/upload_files/image/2022/202209_166293.png',and still 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'"
urls = re.findall(r"'(https?:\S+?)'", test_string)
print(urls)

This prints:

['https://news.sky.net/upload_files/image/2022/202209_166293.png',
 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg']

Comments

2

You could match any URL inside the string you have by using the following regex '(https?://\S+)'

by applying this to your code it would be something like this:

import re

string = "Some string here'https://news.sky.net/upload_files/image/2022/202209_166293.png' And here as well 'https://news.sky.net/upload_files/image/2022/202209_166293.jpg' that's it tho"

res = re.findall(r"(http(s)?://\S+)", string)

print(res)

this will return a list of URLs got collected from the string:

[
    'https://news.sky.net/upload_files/image/2022/202209_166293.png', 
    'https://news.sky.net/upload_files/image/2022/202209_166293.jpg'
]
Regex Explaination:

'(https?://\S+)'

  • https? - to check if the url is https or http
  • \S+ - any non-whitespace character one or more times

So this will get either https or http then after :// characters it will take any non-whitespace character one or more times

Hope you find this helpful.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.