Removing HTML formatting from column in dataframe

Question

I have a dataframe where I need to remove the HTML tags and convert the data to just plain text.

I have found the following (Python code to remove HTML tags from a string):

import re
CLEANR = re.complile('<.*?>')
def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', rawhtml)
    return cleartext

I'm applying it to my column using:

df['col'] = df['col'].apply(cleanhtml(df['col']))

This caused an error as the 'col' was of the datatype Object, so I amended the function to convert the passed argument to a string, as follows:

import re
CLEANR = re.complile('<.*?>')
def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', str(rawhtml))
    return cleartext

The code still fails as it's receiving an object not string. The error is:

Name: col, Length: 1021, dtype: object' is not a valid function for series' object.

Can anyone nudge me in the right direction please? Thanks.

could you share a sample code of you DataFrame, please?

Gооd_Mаn
– Gооd_Mаn

2024-04-17 17:26:24 +00:00
Commented Apr 17, 2024 at 17:26 — Gооd_Mаn
– Gооd_Mаn, Commented Apr 17, 2024 at 17:26

Gооd_Mаn · Accepted Answer · 2024-04-17 17:31:41Z

1

import re
import pandas as pd

raw_html = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""

tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')
clean_html = lambda rawhtml: tag_re.sub('', str(rawhtml))
df = pd.DataFrame({"col":[raw_html, raw_html]})
html_to_text = [clean_html(h) for h in df.col]

df.col = html_to_text
print(df)

Output:

0    \nTitle\nA long text........ \n a link \n
1    \nTitle\nA long text........ \n a link \n
Name: col, dtype: object

edited Apr 17, 2024 at 17:31

answered Apr 17, 2024 at 17:14

Gооd_Mаn

1,05112 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Removing HTML formatting from column in dataframe

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related