0

I am scraping a page with Python and BeautifulSoup library.

I have to get the URL only from this string. This actually is in href attribute of the a tag. I have scraped it but cannot seem to find a way to extract the URL from this

javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
3
  • 1
    Would it be too easy to take a substring from ( to )? Commented Nov 21, 2014 at 18:14
  • Get sub string inside () and then explode it on the basis of , and then get first index value Commented Nov 21, 2014 at 18:20
  • Oh sorry i didnt notice there are more than 1 string inside the brackets Commented Nov 21, 2014 at 18:21

4 Answers 4

2

You can write a straightforward regex to extract the URL.

>>> import re
>>> href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');"
>>> re.findall(r"'(.*?)'", href)
['/Sheraton-Tucson-Hotel-177/tnc/150/24795/en', 'TC_POPUP', 'width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no']
>>> _[0]
'/Sheraton-Tucson-Hotel-177/tnc/150/24795/en'

The regex in question here is

'(.*?)'

Which reads "find a single-quote, followed by whatever (and capture the whatever), followed by another single quote, and do so non-greedily because of the ? operator". This extracts the arguments of window.open; then, just pick the first one to get the URL.

You shouldn't have any nested ' in your href, since those should be escaped to %27. If you do, though, this will not work, and you may need a solution that doesn't use regexes.

Sign up to request clarification or add additional context in comments.

1 Comment

I will give it a try ... By the way .. It is also possible that I can first get the string in () and then explode it , and then get first element
1

I did it that way.

terms = javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');

terms.split("('")[1].split("','")[0]

outputs

/Sheraton-Tucson-Hotel-177/tnc/150/24795/en

Comments

0

Instead of a regex, you could just partition it twice on something, (eg: '):

s.partition("'")[2].partition("'")[0]
# /Sheraton-Tucson-Hotel-177/tnc/150/24795/en

2 Comments

Please see my answer too ... what would you suggest is the good one
@Mani by using partition, you only split "once", and you're guaranteed to get empty strings where the delimiter is not present. So 1) It's more efficient, and 2) it's safer, as your .split could raise an IndexError where it's not present... Your choice of delimiters is possibly more sensible though
-1

Here's a quick and ugly answer

href.split("'")[1]

3 Comments

Algorithm Rule #1 : why use 2 splits, when you can use 1
Your line produces an error ... thats why i said i could downvote you
href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');" print(href.split("'")[1]) This just runs fine. Dont know what you talking about

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.