Get only URL from string - Python

Question

I am scraping a page with Python and BeautifulSoup library.

I have to get the URL only from this string. This actually is in href attribute of the a tag. I have scraped it but cannot seem to find a way to extract the URL from this

javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');

Get sub string inside () and then explode it on the basis of , and then get first index value — user4275254
– user4275254, Commented Nov 21, 2014 at 18:20
Oh sorry i didnt notice there are more than 1 string inside the brackets — Tim
– Tim, Commented Nov 21, 2014 at 18:21

senshin · Accepted Answer · 2014-11-21 18:27:09Z

2

You can write a straightforward regex to extract the URL.

>>> import re
>>> href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');"
>>> re.findall(r"'(.*?)'", href)
['/Sheraton-Tucson-Hotel-177/tnc/150/24795/en', 'TC_POPUP', 'width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no']
>>> _[0]
'/Sheraton-Tucson-Hotel-177/tnc/150/24795/en'

The regex in question here is

'(.*?)'

Which reads "find a single-quote, followed by whatever (and capture the whatever), followed by another single quote, and do so non-greedily because of the ? operator". This extracts the arguments of window.open; then, just pick the first one to get the URL.

You shouldn't have any nested ' in your href, since those should be escaped to %27. If you do, though, this will not work, and you may need a solution that doesn't use regexes.

answered Nov 21, 2014 at 18:27

senshin

10.5k7 gold badges49 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user4275254 Over a year ago

I will give it a try ... By the way .. It is also possible that I can first get the string in () and then explode it , and then get first element

user4275254 · Accepted Answer · 2014-11-21 18:38:18Z

1

I did it that way.

terms = javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');

terms.split("('")[1].split("','")[0]

outputs

/Sheraton-Tucson-Hotel-177/tnc/150/24795/en

answered Nov 21, 2014 at 18:38

user4275254

Comments

Jon Clements · Accepted Answer · 2014-11-21 18:36:37Z

0

Instead of a regex, you could just partition it twice on something, (eg: '):

s.partition("'")[2].partition("'")[0]
# /Sheraton-Tucson-Hotel-177/tnc/150/24795/en

answered Nov 21, 2014 at 18:36

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

2 Comments

user4275254 Over a year ago

Please see my answer too ... what would you suggest is the good one

Jon Clements Over a year ago

@Mani by using partition, you only split "once", and you're guaranteed to get empty strings where the delimiter is not present. So 1) It's more efficient, and 2) it's safer, as your .split could raise an IndexError where it's not present... Your choice of delimiters is possibly more sensible though

Mithun · Accepted Answer · 2014-11-21 18:52:38Z

-1

Here's a quick and ugly answer

href.split("'")[1]

answered Nov 21, 2014 at 18:52

Mithun

193 bronze badges

3 Comments

Mithun Over a year ago

Algorithm Rule #1 : why use 2 splits, when you can use 1

user4275254 Over a year ago

Your line produces an error ... thats why i said i could downvote you

Mithun Over a year ago

href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');" print(href.split("'")[1])

This just runs fine. Dont know what you talking about

Collectives™ on Stack Overflow

Get only URL from string - Python

4 Answers 4

1 Comment

Comments

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related