0

I have a bytes object that contains urls:

> body.decode("utf-8") 
> 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'

I need to split it into a list with each url as a separate element:

import re
pattern = '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$'

urls = re.compile(pattern).split(body.decode("utf-8"))

What I get is a list of one element with all urls pasted together:

['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n']

How do I split each url into a separate element?

4
  • 1
    Why don't you split with \s+? That should give you required results. Commented Nov 11, 2018 at 17:48
  • @PushpeshKumarRajwanshi can you give an example? Commented Nov 11, 2018 at 17:49
  • 1
    Its probably because you're pattern doesn't match anything, so it doesn't split anything. Commented Nov 11, 2018 at 19:07
  • You'd be better off using a findall() kind of thing using a modified pattern of yours (?m)^(?:https?:\/\/(?:www\.)?)?[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?::[0-9]{1,5})?(?:\/.*)? Commented Nov 11, 2018 at 19:14

1 Answer 1

1

Try splitting it with \s+

Try this sample python code,

import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.compile('\s+').split(s)
print(urls)

This outputs,

['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/', '']

Does this result looks ok? Or we can work on it and make as you desire.

In case you don't want empty string ('') in your result list (because of \r\n in the end), you can use find all to find all the URLs in your string. Sample python code for same is following,

import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.findall('http.*?(?=\s+)', s)
print(urls)

This gives following output,

['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/']
Sign up to request clarification or add additional context in comments.

1 Comment

You have all the URLs in urls array, you can use them the way you want. I am not sure what you mean by "put each url into a separate element of a list"?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.