0

I'm having a really annoying problem, the answer is probably very simple yet I can't put 2 and 2 together...

I have an example of a string that'll look something like this:

<a href="javascript:void(0);" onclick="viewsite(38903);" class="followbutton">Visit</a>

The numbers 38903 will be different every time I load a page. I need a method to be able to parse these numbers every time I load the page. I've gotten far enough to grab and contain the piece of HTML code above, but can't grab just the numbers.

Again, probably a really easy thing to do, just can't figure it out. Thanks in advance!

3
  • "grab and contain the piece of HTML code" With what? Commented May 8, 2012 at 5:32
  • Anything in Python. Currently using BeautifulSoup though. Commented May 8, 2012 at 5:36
  • Added BeautifulSoup to tag list. Commented May 8, 2012 at 6:11

3 Answers 3

1

If you're using BeautifulSoup it is dead simple to get just the onclick string, which will make this easier. But here's a really crude way to do it:

import re
result = re.sub("\D", "", html_string)[1:]

\D matches all non-digits, so this will remove everything in the string that isn't a number. Then take a slice to get rid of the "0" from javascript:void(0).

Other options: use re.search to grab series of digits and take the second group. Or use re.search to match a series of digits after a substring, where the substring is <a href="javascript:void(0);" onclick="viewsite(.

Edit: It sounds like you are using BeautifulSoup. In that case, presumably you have an object which represents the a tag. Let's assume that object is named a:

import re
result = re.sub("\D", "", a['onclick'])
Sign up to request clarification or add additional context in comments.

1 Comment

(The original version of this answer treated re.sub as if it modified html_string itself, which of course it does not because Python strings are immutable. It's been edited to fix that.)
1
import re
r = re.compile('viewsite\((\d+)\)')
r.findall(s)

This will specifically look for the all-digit argument to viewsite(). You may prefer this to Andrew's answer since if other digits were to show up in the HTML string, you will start getting incorrect results.

1 Comment

Yes, this is better -- though if the OP is using BeautifulSoup as mentioned in question comments, it's even better to just nab the onclick string and work on that instead of parsing the whole thing.
0
>>> import re
>>> grabbed_html = '''<a href="javascript:void(0);" onclick="viewsite(38903);" class="followbutton">Visit</a>'''
>>> re.findall(r'viewsite\((\d+)\);',grabbedhtml)[0]
'38903'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.