1

My HTML text looks like this..I want to extract only PLAIN TEXT from HTML text using REGEX in python (NOT USING HTML PARSERS)

<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>

How to find exact regex to get the plain text?

1
  • Is there any good reason not to use a parser? Commented Nov 24, 2017 at 6:52

2 Answers 2

1

You might be better of using a parser here:

import html, xml.etree.ElementTree as ET

# decode
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""

# construct the dom
root = ET.fromstring(html.unescape(string))

# search it
for p in root.findall("*"):
    print(p.text)

This yields

Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

Obviously, you might want to change the xpath, thus have a look at the possibilities.


Addendum:

It is possible to use a regular expression here, but this approach is really error-prone and not advisable:

import re

string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""

rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')

print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']

The idea is to look for an uppercase letter and match word characters, whitespaces and commas up to a dot. See a demo on regex101.com.

Sign up to request clarification or add additional context in comments.

1 Comment

@s.s: See the lower part.
1

You can do this with Javascript with a simple selector method and then retrieving the .innerHTML property.

//select the class for which you want to pull the HTML from
let div = document.getElementsByClassName('text-div');
//select the first element of NodeList returned from selector method and get the inner HTML 
let text = div[0].innerHTML; 

This will select the element whose HTML you want to retrieve and then it will pull the inner HTML text, assuming you only want what is between the HTML tags and not the tags themselves.

Regex is not necessary for this. You'd have to implement the Regex with JS or some back-end and as long as you can insert a JS script into your project, then you can get the inner HTML.

If you're scraping data, your library in whatever language will most likely have selector methods and ways to easily retrieve the HTML text without the need for Regex.

2 Comments

i am doing this in python ..but new to regex so want to know the regex for doing this
Oh sorry didn't see this was Python.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.