extract text from html tags using regex

Question

My HTML text looks like this..I want to extract only PLAIN TEXT from HTML text using REGEX in python (NOT USING HTML PARSERS)

&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;

How to find exact regex to get the plain text?

Is there any good reason not to use a parser?

Jan
– Jan

2017-11-24 06:52:06 +00:00
Commented Nov 24, 2017 at 6:52 — Jan
– Jan, Commented Nov 24, 2017 at 6:52

Jan · Accepted Answer · 2017-11-24 06:57:38Z

You might be better of using a parser here:

import html, xml.etree.ElementTree as ET

# decode
string = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;"""

# construct the dom
root = ET.fromstring(html.unescape(string))

# search it
for p in root.findall("*"):
    print(p.text)

This yields

Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

Obviously, you might want to change the xpath, thus have a look at the possibilities.

Addendum:

It is possible to use a regular expression here, but this approach is really error-prone and not advisable:

import re

string = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;"""

rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')

print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']

The idea is to look for an uppercase letter and match word characters, whitespaces and commas up to a dot. See a demo on regex101.com.

cchoe1 · Accepted Answer · 2017-11-24 06:02:58Z

1

You can do this with Javascript with a simple selector method and then retrieving the .innerHTML property.

//select the class for which you want to pull the HTML from
let div = document.getElementsByClassName('text-div');
//select the first element of NodeList returned from selector method and get the inner HTML 
let text = div[0].innerHTML;

This will select the element whose HTML you want to retrieve and then it will pull the inner HTML text, assuming you only want what is between the HTML tags and not the tags themselves.

Regex is not necessary for this. You'd have to implement the Regex with JS or some back-end and as long as you can insert a JS script into your project, then you can get the inner HTML.

If you're scraping data, your library in whatever language will most likely have selector methods and ways to easily retrieve the HTML text without the need for Regex.

answered Nov 24, 2017 at 6:02

cchoe1

4094 silver badges16 bronze badges

2 Comments

s.s Over a year ago

i am doing this in python ..but new to regex so want to know the regex for doing this

cchoe1 Over a year ago

Oh sorry didn't see this was Python.

Collectives™ on Stack Overflow

extract text from html tags using regex

2 Answers 2

Addendum:

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Addendum:

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related