replacing html syntax/ascii code in pandas columns [duplicate]

Question

I'm trying to find a way to replace the html syntax within a string. I get a lot of csv files, so I found pandas to be an awesome tool to deal with csv.

Sometimes I receive data that has html syntax embedded in a string, for example in the address column, i see 125&#45128 downing st , which is 125-128 downing st. It's not only &#49, I also get &#39 and &amp, which are ' and &.

I tried to do this code, but I know it's not giving me the right format

df = df.replace(r'[&#45]','-', regex=True)

I get several columns such as business name, address, city, state - so I would like to target all columns just to make sure all the html syntax are removed/replaced

Dataframe format

    Address 1                   Company
0   1st&#452st Avenue N         johnson &amp johnson
1   243&#454800 Kingsway Ave    cold &amp brew 
2   300 Hwy 7                   coder&#39s club

Desired format

    Address 1                   Company
0   1st-2st Avenue N            johnson and johnson
1   243-4800 Kingsway Ave       cold and brew 
2   300 Hwy 7                   coder's club

I'm a newbie to pandas, but so I far I love this tool. Thanks for your help.

UPDATE:

Ok, I found my mistake on the code I use, this df = df.replace(r'[&#45]','-', regex=True) should be df = df.replace(r'&#45','-', regex=True); this will replace that html syntax. However I still would like to find a pythonic solution to this problem.

I am tempted to do this code in order to remove the other html syntax.

df = df.replace(r'&#45','-', regex=True).replace(r'&#39','', regex=True).replace(r'&amp','and', regex=True)

is there a clean solution to this line of code?

Thanks for editing your answer. I think you can change the values by lists. For example df.replace([r'&#45', r'&#39'], ['-', ''], regex=True), but I am not sure. Check the docs — gabra
– gabra, Commented Sep 22, 2016 at 0:17
@gabra no problem and thanks for the suggestion about the edit. Yes, the code provided in the first comment worked like a charm, this is what I need, trying to be a better python coder. Thanks so much! — medev21
– medev21, Commented Sep 22, 2016 at 0:28

gabra · Accepted Answer · 2016-09-22 00:29:14Z

1

Just to leave an answer to new people coming here, a possible solution, as discussed on the comments, is:

df.replace([r'&#45', r'&#39', r'&amp'], ['-', '', 'and'], regex=True)

or

df.replace({'&#45':'-', r'&#39':'', '&amp':'and'}, regex=True)

answered Sep 22, 2016 at 0:29

gabra

10.8k4 gold badges32 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

replacing html syntax/ascii code in pandas columns [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related