2

I'm trying to find a way to replace the html syntax within a string. I get a lot of csv files, so I found pandas to be an awesome tool to deal with csv.

Sometimes I receive data that has html syntax embedded in a string, for example in the address column, i see 125&#45128 downing st , which is 125-128 downing st. It's not only &#49, I also get &#39 and &amp, which are ' and &.

I tried to do this code, but I know it's not giving me the right format

df = df.replace(r'[&#45]','-', regex=True)

I get several columns such as business name, address, city, state - so I would like to target all columns just to make sure all the html syntax are removed/replaced

Dataframe format

    Address 1                   Company
0   1st&#452st Avenue N         johnson &amp johnson
1   243&#454800 Kingsway Ave    cold &amp brew 
2   300 Hwy 7                   coder&#39s club

Desired format

    Address 1                   Company
0   1st-2st Avenue N            johnson and johnson
1   243-4800 Kingsway Ave       cold and brew 
2   300 Hwy 7                   coder's club 

I'm a newbie to pandas, but so I far I love this tool. Thanks for your help.

UPDATE:

Ok, I found my mistake on the code I use, this df = df.replace(r'[&#45]','-', regex=True) should be df = df.replace(r'&#45','-', regex=True); this will replace that html syntax. However I still would like to find a pythonic solution to this problem.

I am tempted to do this code in order to remove the other html syntax.

df = df.replace(r'&#45','-', regex=True).replace(r'&#39','', regex=True).replace(r'&amp','and', regex=True)

is there a clean solution to this line of code?

2
  • Thanks for editing your answer. I think you can change the values by lists. For example df.replace([r'&#45', r'&#39'], ['-', ''], regex=True), but I am not sure. Check the docs Commented Sep 22, 2016 at 0:17
  • 1
    @gabra no problem and thanks for the suggestion about the edit. Yes, the code provided in the first comment worked like a charm, this is what I need, trying to be a better python coder. Thanks so much! Commented Sep 22, 2016 at 0:28

1 Answer 1

1

Just to leave an answer to new people coming here, a possible solution, as discussed on the comments, is:

df.replace([r'&#45', r'&#39', r'&amp'], ['-', '', 'and'], regex=True)

or

df.replace({'&#45':'-', r'&#39':'', '&amp':'and'}, regex=True)
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.