52

I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:

Your number is <b>123</b>

Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?

5
  • Is the text "Your number is" actually inside any tags? Commented Jun 23, 2012 at 16:41
  • 4
    Relevant: stackoverflow.com/questions/1732348/… Commented Jun 23, 2012 at 16:41
  • 1
    @Endophage: meta-relevant Commented Jun 23, 2012 at 17:19
  • @thg435 Assuming most if not all problems on SO are small test examples for larger problems, very relevant. The op wants to parse html with regexes... Note I didn't link the rant, just the question. Commented Jun 23, 2012 at 17:23
  • 2
    I suggest to use lxml to parse HTML Commented Jun 25, 2012 at 12:18

10 Answers 10

66
import re
m = re.search("Your number is <b>(\d+)</b>",
      "xxx Your number is <b>123</b>  fdjsk")
if m:
    print m.groups()[0]
Sign up to request clarification or add additional context in comments.

1 Comment

Sorry for not being clear enough, However I used a slightly modified version that is working for me. re.search("Your number is <b>([a-zA-Z_][a-zA-Z_0-9]*)</b>",loginData)
26

Given s = "Your number is <b>123</b>" then:

import re 
m = re.search(r"\d+", s)

will work and give you

m.group()
'123'

The regular expression looks for 1 or more consecutive digits in your string.

Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search() to make sure that m contained a valid reference, otherwise m.group() would result in a AttributeError: exception.

Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.

6 Comments

Why the downvote? This is functional and meets OP's requirements as far as I can tell. I am happy to correct any errors or improve my answer if given constructive feedback. However, downvotes without explanation don't help OP, SO or me.
Heh, we've all done it. As for the downvote, maybe someone wanted something more robust? Currently this would fail if there were any digits before the 123.
@DSM :-) .. yes, I agree, this is a narrow solution which is really pretty much just aimed at the specific problem posted. In this case testing the return value of re.search() wasn't necessary either, but that should also happen.
I don't think the OP wants numbers. Their requirements are quite clear: contents of first bold text after string "Your number is"
@thg435 .. it says "how can I extract 123," .. and "..extracts a number from HTML" .. that's what I did. Am I missing something?
|
12
import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(\d+)</b>',x).group(0)

this searches for the number that follows the 'Your number is' string

1 Comment

If you only want the 123, don't you want .group(1)?
6
import re
print re.search(r'(\d+)', 'Your number is <b>123</b>').group(0)

Comments

4

The simplest way is just extract digit(number)

re.search(r"\d+",text)

Comments

2
val="Your number is <b>123</b>"

Option : 1

m=re.search(r'(<.*?>)(\d+)(<.*?>)',val)

m.group(2)

Option : 2

re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)

Comments

2
import re
found = re.search("your number is <b>(\d+)</b>", "something.... Your number is <b>123</b> something...")

if found:
    print found.group()[0]

Here (\d+) is the grouping, since there is only one group [0] is used. When there are several groupings [grouping index] should be used.

Comments

1

To extract as python list you can use findall

>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = '\d+'
>>> re.findall(pattern,string)
['123']
>>>

Comments

0

You can use the following example to solve your problem:

import re

search = re.search(r"\d+",text).group(0) #returns the number that is matched in the text

print("Starting Index Of Digit", search.start())

print("Ending Index Of Digit:", search.end())

Comments

0
import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(\d+)</b>',x).group(1)
print(output)

2 Comments

Welcome to StackOverflow. Although this may answer the question, it would be useful to explain your code a bit.
This is a correction to @muffel’s answer, and should acknowledge that source.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.