Python - Parse String

Question

I'm having a really annoying problem, the answer is probably very simple yet I can't put 2 and 2 together...

I have an example of a string that'll look something like this:

<a href="javascript:void(0);" onclick="viewsite(38903);" class="followbutton">Visit</a>

The numbers 38903 will be different every time I load a page. I need a method to be able to parse these numbers every time I load the page. I've gotten far enough to grab and contain the piece of HTML code above, but can't grab just the numbers.

Again, probably a really easy thing to do, just can't figure it out. Thanks in advance!

"grab and contain the piece of HTML code" With what?

Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams

2012-05-08 05:32:57 +00:00
Commented May 8, 2012 at 5:32 — Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams, Commented May 8, 2012 at 5:32
Anything in Python. Currently using BeautifulSoup though.

Dustin
– Dustin

2012-05-08 05:36:08 +00:00
Commented May 8, 2012 at 5:36 — Dustin
– Dustin, Commented May 8, 2012 at 5:36
Added BeautifulSoup to tag list.

Andrew Gorcester
– Andrew Gorcester

2012-05-08 06:11:55 +00:00
Commented May 8, 2012 at 6:11 — Andrew Gorcester
– Andrew Gorcester, Commented May 8, 2012 at 6:11

Andrew Gorcester · Accepted Answer · 2012-05-08 05:36:29Z

1

If you're using BeautifulSoup it is dead simple to get just the onclick string, which will make this easier. But here's a really crude way to do it:

import re
result = re.sub("\D", "", html_string)[1:]

\D matches all non-digits, so this will remove everything in the string that isn't a number. Then take a slice to get rid of the "0" from javascript:void(0).

Other options: use re.search to grab series of digits and take the second group. Or use re.search to match a series of digits after a substring, where the substring is <a href="javascript:void(0);" onclick="viewsite(.

Edit: It sounds like you are using BeautifulSoup. In that case, presumably you have an object which represents the a tag. Let's assume that object is named a:

import re
result = re.sub("\D", "", a['onclick'])

answered May 8, 2012 at 5:36

Andrew Gorcester

20.1k8 gold badges63 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Andrew Gorcester Over a year ago

(The original version of this answer treated re.sub as if it modified html_string itself, which of course it does not because Python strings are immutable. It's been edited to fix that.)

BrainCore · Accepted Answer · 2012-05-08 05:42:07Z

1

import re
r = re.compile('viewsite\((\d+)\)')
r.findall(s)

This will specifically look for the all-digit argument to viewsite(). You may prefer this to Andrew's answer since if other digits were to show up in the HTML string, you will start getting incorrect results.

answered May 8, 2012 at 5:42

BrainCore

5,4504 gold badges35 silver badges38 bronze badges

1 Comment

Andrew Gorcester Over a year ago

Yes, this is better -- though if the OP is using BeautifulSoup as mentioned in question comments, it's even better to just nab the onclick string and work on that instead of parsing the whole thing.

mshsayem · Accepted Answer · 2012-05-08 05:41:44Z

0

>>> import re
>>> grabbed_html = '''<a href="javascript:void(0);" onclick="viewsite(38903);" class="followbutton">Visit</a>'''
>>> re.findall(r'viewsite\((\d+)\);',grabbedhtml)[0]
'38903'

answered May 8, 2012 at 5:41

mshsayem

18.1k11 gold badges65 silver badges73 bronze badges

Collectives™ on Stack Overflow

Python - Parse String

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related