I have a dataframe where I need to remove the HTML tags and convert the data to just plain text.
I have found the following (Python code to remove HTML tags from a string):
import re
CLEANR = re.complile('<.*?>')
def cleanhtml(raw_html):
cleantext = re.sub(CLEANR, '', rawhtml)
return cleartext
I'm applying it to my column using:
df['col'] = df['col'].apply(cleanhtml(df['col']))
This caused an error as the 'col' was of the datatype Object, so I amended the function to convert the passed argument to a string, as follows:
import re
CLEANR = re.complile('<.*?>')
def cleanhtml(raw_html):
cleantext = re.sub(CLEANR, '', str(rawhtml))
return cleartext
The code still fails as it's receiving an object not string. The error is:
Name: col, Length: 1021, dtype: object' is not a valid function for series' object.
Can anyone nudge me in the right direction please? Thanks.