2

I have a pandas dataframe which has two columns: One column with HTML and small formatting tags like br and &nbsp in it and another column named USEFUL.

I want to convert the HTML column to plain text without the "br" tags and &nbsp's. The HTML may contain other formatting tags as well, so using regular expression is not an option. Apologize for not providing a data frame look. My formatting is really bad.

Thanks in advance.

1 Answer 1

4

Method 1:

According to this link, this method is faster than method 2. It requires installing the selectolax module (use: pip install selectolax). You can find further examples of using this module in here.

from selectolax.parser import HTMLParser

df['string_in_HTML']=data.apply(lambda x: HTMLParser(x['HTML']).body.text(separator=' ').replace('\n',' '),axis=1)

Method 2:

This is the most popular method I have come across in SO and requires installing bs4 module (use: pip install bs4)

from bs4 import BeautifulSoup

df['string_in_HTML']=data.apply(lambda x: BeautifulSoup(x['HTML']).get_text().replace('\n',' '),axis=1)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.