How to work with unicode in Python

Question

I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various 'converters' and am starting to lean towards creating my own dictionary for the entities and symbols and running a replace on the string. I am considering this because I want to automate the process and there is a lot of variability in the quality of the underlying html. To begin comparing the speed of my solution and one of the alternatives for example pyparsing I decided to test replace of \xa0 using the string method replace. I get a

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

The actual line of code was

s=unicodestring.replace('\xa0','')

Anyway-I decided that I needed to preface it with an r so I ran this line of code:

s=unicodestring.replace(r'\xa0','')

It runs without error but I when I look at a slice of s I see that the \xaO is still there

Why would you prefix '\xa0' with an r? That makes it a raw string - that is, it literally contains backslash, x, a, 0. Without the r, it contained a single character with hex code a0, which I think is what you wanted. — David Z
– David Z, Commented Apr 15, 2009 at 18:26
Because I was trying to guess why I got the error and I know that sometimes to force the \ to be read you have to make it a string literal and also the \xa0 is what actually exists in my source. what is hex code a0? — PyNEwbie
– PyNEwbie, Commented Apr 15, 2009 at 18:44

z33m · Accepted Answer · 2009-04-15 18:22:48Z

25

may be you should be doing

s=unicodestring.replace(u'\xa0',u'')

answered Apr 15, 2009 at 18:22

z33m

6,0531 gold badge33 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PyNEwbie Over a year ago

So how did you know to do this since I have not seen this in any example? Thanks

dbr · Accepted Answer · 2009-04-15 20:33:03Z

6

s=unicodestring.replace('\xa0','')

..is trying to create the unicode character \xa0, which is not valid in an ASCII sctring (the default string type in Python until version 3.x)

The reason r'\xa0' did not error is because in a raw string, escape sequences have no effect. Rather than trying to encode \xa0 into the unicode character, it saw the string as a "literal backslash", "literal x" and so on..

The following are the same:

>>> r'\xa0'
'\\xa0'
>>> '\\xa0'
'\\xa0'

This is something resolved in Python v3, as the default string type is unicode, so you can just do..

>>> '\xa0'
'\xa0'

I am trying to clean all of the HTML out of a string so the final output is a text file

I would strongly recommend BeautifulSoup for this. Writing an HTML cleaning tool is difficult (given how horrible most HTML is), and BeautifulSoup does a great job at both parsing HTML, and dealing with Unicode..

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<html><body><h1>Hi</h1></body></html>")
>>> print soup.prettify()
<html>
 <body>
  <h1>
   Hi
  </h1>
 </body>
</html>

answered Apr 15, 2009 at 20:33

dbr

171k69 gold badges284 silver badges348 bronze badges

3 Comments

PyNEwbie Over a year ago

I appreciate this answer. I have used BS to extract data from tables and it is very useful. However, it seems to me that to remove the html using BS I have to know what is present. Am I wrong about that?

dbr Over a year ago

I'm not sure what you mean? You can remove HTML via countless ways, from the first table in a div, to by-class-or-id etc..

Gourneau Over a year ago

BeautifulSoup.prettyify() was just a life saver! Thanks!

Wayne Koorts · Accepted Answer · 2009-04-15 18:17:29Z

3

Look at the codecs standard library, specifically the encode and decode methods provided in the Codec base class.

There's also a good article here that puts it all together.

answered Apr 15, 2009 at 18:17

Wayne Koorts

11.2k13 gold badges49 silver badges73 bronze badges

1 Comment

PyNEwbie Over a year ago

Thanks-great article you are right it does put a lot together.

André Dion · Accepted Answer · 2014-06-30 11:44:49Z

2

Instead of this, it's better to use standard python features.

For example:

string = unicode('Hello, \xa0World', 'utf-8', 'replace')

or

string = unicode('Hello, \xa0World', 'utf-8', 'ignore')

where replace will replace \xa0 to \\xa0.

But if \xa0 is really not meaningful for you and you want to remove it then use ignore.

edited Jun 30, 2014 at 11:44

André Dion

21.8k7 gold badges59 silver badges60 bronze badges

answered Sep 13, 2012 at 13:19

Tejas Tank

1,2362 gold badges18 silver badges31 bronze badges

Comments

Ólafur Waage · Accepted Answer · 2009-04-15 18:18:02Z

1

Just a note regarding HTML cleaning. It is very very hard, since

<
body
>

Is a valid way to write HTML. Just an fyi.

answered Apr 15, 2009 at 18:18

Ólafur Waage

70.3k22 gold badges147 silver badges199 bronze badges

Comments

Jason Coon · Accepted Answer · 2009-04-15 18:18:07Z

0

You can convert it to unicode in this way:

print u'Hello, \xa0World'  # print Hello,  World

answered Apr 15, 2009 at 18:18

Jason Coon

18.6k10 gold badges44 silver badges50 bronze badges

Collectives™ on Stack Overflow

How to work with unicode in Python

6 Answers 6

1 Comment

3 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

3 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related