Splitting a string in Python based on a regex pattern

Question

I have a bytes object that contains urls:

> body.decode("utf-8") 
> 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'

I need to split it into a list with each url as a separate element:

import re
pattern = '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$'

urls = re.compile(pattern).split(body.decode("utf-8"))

What I get is a list of one element with all urls pasted together:

['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n']

How do I split each url into a separate element?

Why don't you split with \s+? That should give you required results. — Pushpesh Kumar Rajwanshi
– Pushpesh Kumar Rajwanshi, Commented Nov 11, 2018 at 17:48
Its probably because you're pattern doesn't match anything, so it doesn't split anything. — user557597
– user557597, Commented Nov 11, 2018 at 19:07
You'd be better off using a findall() kind of thing using a modified pattern of yours (?m)^(?:https?:\/\/(?:www\.)?)?[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?::[0-9]{1,5})?(?:\/.*)? — user557597
– user557597, Commented Nov 11, 2018 at 19:14

Pushpesh Kumar Rajwanshi · Accepted Answer · 2018-11-11 20:00:49Z

1

Try splitting it with \s+

Try this sample python code,

import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.compile('\s+').split(s)
print(urls)

This outputs,

['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/', '']

Does this result looks ok? Or we can work on it and make as you desire.

In case you don't want empty string ('') in your result list (because of \r\n in the end), you can use find all to find all the URLs in your string. Sample python code for same is following,

import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.findall('http.*?(?=\s+)', s)
print(urls)

This gives following output,

['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/']

edited Nov 11, 2018 at 20:00

answered Nov 11, 2018 at 17:50

Pushpesh Kumar Rajwanshi

18.4k2 gold badges22 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Pushpesh Kumar Rajwanshi Over a year ago

You have all the URLs in urls array, you can use them the way you want. I am not sure what you mean by "put each url into a separate element of a list"?

Collectives™ on Stack Overflow

Splitting a string in Python based on a regex pattern

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related