Strip everything but URL from a string in python

Question

I'm grabbing a series of links from a website with python and BS4 but i need to clean them up so I only get the URL in the string.

the links i get look like this:

javascript:changeChannel('http://some-server.com/with1234init.also', 20);

and i need it to look like this

http://some-server.com/with1234init.also

Are all strings of the exact same format, or are there corner cases in the HTML that may cause simple rules to fail? — jozxyqk
– jozxyqk, Commented Feb 20, 2014 at 11:48
I forgot to mention that all the links i grab is different. They all start with the javascript:changeChannel(' part but the urls are different and the end after the last ' is also different in all of the links — user3332151
– user3332151, Commented Feb 20, 2014 at 13:45

Paulo Bu · Accepted Answer · 2014-02-20 10:32:39Z

1

Well, if all the links are like that one you can do it with a very simple approach:

s.split("'")[1]

For example:

>>>s="javascript:changeChannel('http://some-server.com/with1234init.also', 20);"
>>>s.split("'")
['javascript:changeChannel(',
 'http://some-server.com/with1234init.also',
 ', 20);']

answered Feb 20, 2014 at 10:32

Paulo Bu

29.9k6 gold badges77 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Nafiul Islam Over a year ago

True, and I was about to post this, however, it does not give you something exact. Perhaps, you can do this and then do a search with a regex to determine the index value.

Paulo Bu Over a year ago

Well, if all the strings are formatted the same this will probably work well for everyone. What is the case you say is not exact?

Nafiul Islam Over a year ago

For example, there couple be more than just 2 single quotes in the line. In essence, this solution will only work for this problem but does not solve the issue at large.

Paulo Bu Over a year ago

@GamesBrainiac you're right. The solution is very domain specific. I explained in the answer that all strings needed to be with the same format. But if they are, I think is worth doing it because is very simple.

Nafiul Islam Over a year ago

Indeed, but I was hoping you knew some way to capture a URL (heh) using regex. I've been trying to make one myself, but I fail most of the time.

|

MONTYHS · Accepted Answer · 2014-02-20 10:32:47Z

0

 str = javascript:changeChannel('http://some-server.com/with1234init.also', 20);
 formattedtext  ="http://" + str.split("http://")[1].split(',')[0].strip("'")

answered Feb 20, 2014 at 10:32

MONTYHS

9241 gold badge7 silver badges30 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:28:43Z

A reasonably robust way is to take your chunk of text and search it with a URL-matching regex pattern.

See also:

Python regular expression again - match url
which links to here: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Extracting URL link using regular expression re - string matching - Python

Using regex...

import re
re.search(pattern, text)
... or
re.findall(pattern, text)

A full example...

>>> p = re.compile(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))')
or
>>> p = re.compile('(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:\\\'".,<>?\xc2\xab\xc2\xbb\xe2\x80\x9c\xe2\x80\x9d\xe2\x80\x98\xe2\x80\x99]))')

>>> m = p.search("javascript:changeChannel('http://some-server.com/with1234init.also', 20);")
>>> m.group()
'http://some-server.com/with1234init.also'

the pattern used is from the web URL version in the above link

Note the use of the r prefix and the escaped ' quote towards the end in the first pattern.
using re.compile caches the regex pattern

Collectives™ on Stack Overflow

Strip everything but URL from a string in python

3 Answers 3

9 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related