How to use regex to parse a number from HTML?

Question

I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:

Your number is <b>123</b>

Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?

@thg435 Assuming most if not all problems on SO are small test examples for larger problems, very relevant. The op wants to parse html with regexes... Note I didn't link the rant, just the question. — Endophage
– Endophage, Commented Jun 23, 2012 at 17:23

Yevgen Yampolskiy · Accepted Answer · 2012-06-23 16:56:45Z

66

import re
m = re.search("Your number is <b>(\d+)</b>",
      "xxx Your number is <b>123</b>  fdjsk")
if m:
    print m.groups()[0]

edited Jun 23, 2012 at 16:56

answered Jun 23, 2012 at 16:18

Yevgen Yampolskiy

7,2483 gold badges28 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Saqib Over a year ago

Sorry for not being clear enough, However I used a slightly modified version that is working for me. re.search("Your number is <b>([a-zA-Z_][a-zA-Z_0-9]*)</b>",loginData)

Georg Plaz · Accepted Answer · 2022-07-28 14:22:30Z

26

Given s = "Your number is <b>123</b>" then:

import re 
m = re.search(r"\d+", s)

will work and give you

m.group()
'123'

The regular expression looks for 1 or more consecutive digits in your string.

Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search() to make sure that m contained a valid reference, otherwise m.group() would result in a AttributeError: exception.

Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.

edited Jul 28, 2022 at 14:22

Georg Plaz

6,0185 gold badges44 silver badges66 bronze badges

answered Jun 23, 2012 at 16:15

Levon

144k35 gold badges205 silver badges194 bronze badges

6 Comments

Levon Over a year ago

Why the downvote? This is functional and meets OP's requirements as far as I can tell. I am happy to correct any errors or improve my answer if given constructive feedback. However, downvotes without explanation don't help OP, SO or me.

DSM Over a year ago

Heh, we've all done it. As for the downvote, maybe someone wanted something more robust? Currently this would fail if there were any digits before the 123.

Levon Over a year ago

@DSM :-) .. yes, I agree, this is a narrow solution which is really pretty much just aimed at the specific problem posted. In this case testing the return value of re.search() wasn't necessary either, but that should also happen.

georg Over a year ago

I don't think the OP wants numbers. Their requirements are quite clear: contents of first bold text after string "Your number is"

Levon Over a year ago

@thg435 .. it says "how can I extract 123," .. and "..extracts a number from HTML" .. that's what I did. Am I missing something?

|

muffel · Accepted Answer · 2012-06-23 16:20:55Z

12

import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(\d+)</b>',x).group(0)

this searches for the number that follows the 'Your number is' string

answered Jun 23, 2012 at 16:20

muffel

7,4709 gold badges63 silver badges110 bronze badges

1 Comment

DSM Over a year ago

If you only want the 123, don't you want .group(1)?

the Tin Man · Accepted Answer · 2014-04-15 20:56:15Z

6

import re
print re.search(r'(\d+)', 'Your number is <b>123</b>').group(0)

edited Apr 15, 2014 at 20:56

the Tin Man

161k44 gold badges222 silver badges308 bronze badges

answered Feb 17, 2014 at 19:20

Jacob Abraham

9549 silver badges9 bronze badges

Comments

Avinash Kumar · Accepted Answer · 2016-06-22 10:45:26Z

4

The simplest way is just extract digit(number)

re.search(r"\d+",text)

answered Jun 22, 2016 at 10:45

Avinash Kumar

392 bronze badges

Comments

Nikolay Kostov · Accepted Answer · 2015-07-07 12:16:11Z

2

val="Your number is <b>123</b>"

Option : 1

m=re.search(r'(<.*?>)(\d+)(<.*?>)',val)

m.group(2)

Option : 2

re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)

edited Jul 7, 2015 at 12:16

Nikolay Kostov

17.1k23 gold badges90 silver badges130 bronze badges

answered Jul 7, 2015 at 11:55

user4613285

Comments

Stypox · Accepted Answer · 2018-07-11 20:37:52Z

2

import re
found = re.search("your number is <b>(\d+)</b>", "something.... Your number is <b>123</b> something...")

if found:
    print found.group()[0]

Here (\d+) is the grouping, since there is only one group [0] is used. When there are several groupings [grouping index] should be used.

edited Jul 11, 2018 at 20:37

Stypox

1,23016 silver badges20 bronze badges

answered Jun 14, 2018 at 12:24

Sykam Sreekar Reddy

1512 silver badges3 bronze badges

Comments

Arun · Accepted Answer · 2019-11-25 12:31:44Z

1

To extract as python list you can use findall

>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = '\d+'
>>> re.findall(pattern,string)
['123']
>>>

answered Nov 25, 2019 at 12:31

Arun

8312 gold badges15 silver badges26 bronze badges

Comments

Grant Miller · Accepted Answer · 2018-10-04 01:58:52Z

0

You can use the following example to solve your problem:

import re

search = re.search(r"\d+",text).group(0) #returns the number that is matched in the text

print("Starting Index Of Digit", search.start())

print("Ending Index Of Digit:", search.end())

edited Oct 4, 2018 at 1:58

Grant Miller

29.2k16 gold badges158 silver badges171 bronze badges

answered Oct 3, 2018 at 21:03

sadiq shah

113 bronze badges

Comments

dbc · Accepted Answer · 2021-05-17 13:38:00Z

0

import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(\d+)</b>',x).group(1)
print(output)

edited May 17, 2021 at 13:38

dbc

120k27 gold badges273 silver badges404 bronze badges

answered May 17, 2021 at 13:20

Anand K

112 bronze badges

2 Comments

Dominik Over a year ago

Welcome to StackOverflow. Although this may answer the question, it would be useful to explain your code a bit.

Jeremy Caney Over a year ago

This is a correction to @muffel’s answer, and should acknowledge that source.

Collectives™ on Stack Overflow

How to use regex to parse a number from HTML?

10 Answers 10

1 Comment

6 Comments

1 Comment

Comments

Comments

Option : 1

Option : 2

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

1 Comment

6 Comments

1 Comment

Comments

Comments

Option : 1

Option : 2

Comments

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related