I'm trying to find a way to replace the html syntax within a string. I get a lot of csv files, so I found pandas to be an awesome tool to deal with csv.
Sometimes I receive data that has html syntax embedded in a string, for example in the address column, i see 125끈 downing st , which is 125-128 downing st. It's not only 1, I also get ' and &, which are ' and &.
I tried to do this code, but I know it's not giving me the right format
df = df.replace(r'[-]','-', regex=True)
I get several columns such as business name, address, city, state - so I would like to target all columns just to make sure all the html syntax are removed/replaced
Dataframe format
Address 1 Company
0 1stDŽst Avenue N johnson & johnson
1 243񯂐 Kingsway Ave cold & brew
2 300 Hwy 7 coder's club
Desired format
Address 1 Company
0 1st-2st Avenue N johnson and johnson
1 243-4800 Kingsway Ave cold and brew
2 300 Hwy 7 coder's club
I'm a newbie to pandas, but so I far I love this tool. Thanks for your help.
UPDATE:
Ok, I found my mistake on the code I use, this df = df.replace(r'[-]','-', regex=True) should be df = df.replace(r'-','-', regex=True); this will replace that html syntax. However I still would like to find a pythonic solution to this problem.
I am tempted to do this code in order to remove the other html syntax.
df = df.replace(r'-','-', regex=True).replace(r''','', regex=True).replace(r'&','and', regex=True)
is there a clean solution to this line of code?
df.replace([r'-', r'''], ['-', ''], regex=True), but I am not sure. Check the docs