2

when i read a text, i have string like <h3 class="heading">General Purpose</h3> in some of the lines of the text, now i want to get only value that is General Purpose from above..

d = re.search(re.escape('<h3 class="heading">')+"(.*?)"+re.escape('</h3>'), str(data2))
if d:
    print(d.group(0))
4
  • Can you make your question more clear? Include data2 in your question and also mention what are you trying to extract from data2. Commented Nov 15, 2016 at 5:38
  • Is this an example string, or do you actually have HTML? stackoverflow.com/questions/1732348/… Commented Nov 15, 2016 at 6:19
  • I think you want d.group(1). 0 is the whole matched string, 1 is the first parenthesized group. Commented Nov 15, 2016 at 6:19
  • hey data2 is output of beautiful soup data... Commented Nov 15, 2016 at 6:33

3 Answers 3

4
import re

text="""<h3 class="heading">General Purpose</h3>"""
pattern="(<.*?>)(.*)(<.*?>)"

g=re.search(pattern,text)
g.group(2)

Output:

'General Purpose'

Demo on Regex101

If its a beautiful soup object then its even simpler to get the value. You wont need the regex.

from bs4 import BeautifulSoup

text="""<h3 class="heading">General Purpose</h3>"""
a=BeautifulSoup(text)
print a.select('h3.heading')[0].text

Output:

General Purpose
Sign up to request clarification or add additional context in comments.

3 Comments

If its already a beautifulsoup object then you don't have to use additional regex to extract the data. You can use beautifulsoup methods to extract the html data.
@kattaprasanth: I wrote my answer before your comment that you're using BeautifulSoup. In that case, please remove the "accepted" checkmark from my answer and give it to this one because it's clearly the better one.
@TimPietzcker .. Actually for that, beautifulsoup was returning None.. now its working i am using tbody to get the required output... thanks once again
1

Group 0 contains the entire match; you want the contents of group 1:

print(d.group(1))

But generally, using regexes to parse HTML is not such a good idea (although practically speaking, nested h3 tags should be rather uncommon).

Comments

1

Warning: works ONLY IN python, NOT pcre or JS (Lookbehind is not supported in JS).

(?<=\<\h3 class=\"heading\"\>).*?(?=\<\/h3\>)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.