Converting HTML in a pandas dataframe column which was read from a csv file to Plain text

Question

I have a pandas dataframe which has two columns: One column with HTML and small formatting tags like br and &nbsp in it and another column named USEFUL.

I want to convert the HTML column to plain text without the "br" tags and &nbsp's. The HTML may contain other formatting tags as well, so using regular expression is not an option. Apologize for not providing a data frame look. My formatting is really bad.

Thanks in advance.

Arundathi · Accepted Answer · 2019-04-14 09:04:58Z

4

Method 1:

According to this link, this method is faster than method 2. It requires installing the selectolax module (use: pip install selectolax). You can find further examples of using this module in here.

from selectolax.parser import HTMLParser

df['string_in_HTML']=data.apply(lambda x: HTMLParser(x['HTML']).body.text(separator=' ').replace('\n',' '),axis=1)

Method 2:

This is the most popular method I have come across in SO and requires installing bs4 module (use: pip install bs4)

from bs4 import BeautifulSoup

df['string_in_HTML']=data.apply(lambda x: BeautifulSoup(x['HTML']).get_text().replace('\n',' '),axis=1)

answered Apr 14, 2019 at 9:04

Arundathi

4763 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Converting HTML in a pandas dataframe column which was read from a csv file to Plain text

1 Answer 1

Method 1:

Method 2:

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Method 1:

Method 2:

Comments

Your Answer

Sign up or log in

Post as a guest

Related