1

I'm grabbing a series of links from a website with python and BS4 but i need to clean them up so I only get the URL in the string.

the links i get look like this:

javascript:changeChannel('http://some-server.com/with1234init.also', 20);

and i need it to look like this

http://some-server.com/with1234init.also

3
  • 1
    what is your attempt? Commented Feb 20, 2014 at 10:31
  • Are all strings of the exact same format, or are there corner cases in the HTML that may cause simple rules to fail? Commented Feb 20, 2014 at 11:48
  • I forgot to mention that all the links i grab is different. They all start with the javascript:changeChannel(' part but the urls are different and the end after the last ' is also different in all of the links Commented Feb 20, 2014 at 13:45

3 Answers 3

1

Well, if all the links are like that one you can do it with a very simple approach:

s.split("'")[1]

For example:

>>>s="javascript:changeChannel('http://some-server.com/with1234init.also', 20);"
>>>s.split("'")
['javascript:changeChannel(',
 'http://some-server.com/with1234init.also',
 ', 20);']
Sign up to request clarification or add additional context in comments.

9 Comments

True, and I was about to post this, however, it does not give you something exact. Perhaps, you can do this and then do a search with a regex to determine the index value.
Well, if all the strings are formatted the same this will probably work well for everyone. What is the case you say is not exact?
For example, there couple be more than just 2 single quotes in the line. In essence, this solution will only work for this problem but does not solve the issue at large.
@GamesBrainiac you're right. The solution is very domain specific. I explained in the answer that all strings needed to be with the same format. But if they are, I think is worth doing it because is very simple.
Indeed, but I was hoping you knew some way to capture a URL (heh) using regex. I've been trying to make one myself, but I fail most of the time.
|
0
 str = javascript:changeChannel('http://some-server.com/with1234init.also', 20);
 formattedtext  ="http://" + str.split("http://")[1].split(',')[0].strip("'")

Comments

0

A reasonably robust way is to take your chunk of text and search it with a URL-matching regex pattern.

See also:

Using regex...

import re
re.search(pattern, text)
... or
re.findall(pattern, text)

A full example...

>>> p = re.compile(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))')
or
>>> p = re.compile('(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:\\\'".,<>?\xc2\xab\xc2\xbb\xe2\x80\x9c\xe2\x80\x9d\xe2\x80\x98\xe2\x80\x99]))')

>>> m = p.search("javascript:changeChannel('http://some-server.com/with1234init.also', 20);")
>>> m.group()
'http://some-server.com/with1234init.also'
  1. the pattern used is from the web URL version in the above link

    Note the use of the r prefix and the escaped ' quote towards the end in the first pattern.

  2. using re.compile caches the regex pattern

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.