57

I have a string that I got from reading a HTML webpage with bullets that have a symbol like "•" because of the bulleted list. Note that the text is an HTML source from a webpage using Python 2.7's urllib2.read(webaddress).

I know the unicode character for the bullet character as U+2022, but how do I actually replace that unicode character with something else?

I tried doing str.replace("•", "something")

but it does not appear to work... how do I do this?

8
  • What is the type of the string, and which version of Python are you using? Commented Oct 26, 2012 at 20:14
  • I am using Python 2.7, string is formed from urllib2.read() Commented Oct 26, 2012 at 20:15
  • I'm sorry, I'm not going to download a webpage using urllib2 now. What is the type? str or unicode? Commented Oct 26, 2012 at 20:16
  • Have you tried u.encode('ascii', 'replace') and then replacing '?' ? Commented Oct 26, 2012 at 20:16
  • 1
    if your python code contains utf-8 characters, you should use the 'magic comment' # coding=utf8 in the first or the second line of your code. Commented Oct 15, 2013 at 12:09

7 Answers 7

85
  1. Decode the string to Unicode. Assuming it's UTF-8-encoded:

    str.decode("utf-8")
    
  2. Call the replace method and be sure to pass it a Unicode string as its first argument:

    str.decode("utf-8").replace(u"\u2022", "*")
    
  3. Encode back to UTF-8, if needed:

    str.decode("utf-8").replace(u"\u2022", "*").encode("utf-8")
    

(Fortunately, Python 3 puts a stop to this mess. Step 3 should really only be performed just prior to I/O. Also, mind you that calling a string str shadows the built-in type str.)

Sign up to request clarification or add additional context in comments.

4 Comments

I get this error: TypeError: Can't convert 'bytes' object to str implicitly
could you please elaborate in what way "Python 3 puts a stop to this mess"? How would I do this in Python 3 then?
I get: .decode("utf-8") -> AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?
in python3 drop the .decode("utf-8") and the .replace(u"\u2022", "*") should work. See question stackoverflow.com/q/15335052/1569557
16

Encode string as unicode.

>>> special = u"\u2022"
>>> abc = u'ABC•def'
>>> abc.replace(special,'X')
u'ABCXdef'

2 Comments

what is "special"? I get a nameError: name 'special' is not defined.
@Rolando Notice 'u' has been prefixed in string, that makes it unicode string.
11

Try this one.

you will get the output in a normal string

str.encode().decode('unicode-escape')

and after that, you can perform any replacement.

str.replace('•','something')

1 Comment

This proves useful when the \u sequence is actually present as is in the source string.
3
import re
regex = re.compile("u'2022'",re.UNICODE)
newstring = re.sub(regex, something, yourstring, <optional flags>)

5 Comments

It is not an asterisk, it is a bullet (circle shape)
When trying: re.sub(u'2022', varcontainingstring, ''), it makes the string empty with nothing in it.
@Damascusi fixed - try it now
@NullUserException Why is it a bad idea to use a regex to replace fixed strings?
@AntonTeodor Regex is less efficient than a simple string search and replace. It will work though
-1
str1 = "This is Python\u500cPool"

Encode the string to ASCII and replace all the utf-8 characters with '?'.

str1 = str1.encode("ascii", "replace")

Decode the byte stream to string.

str1 = str1.decode(encoding="utf-8", errors="ignore")

Replace the question mark with the desired character.

str1 = str1.replace("?"," ")

Comments

-2

Funny the answer is hidden in among the answers.

str.replace("•", "something") 

would work if you use the right semantics.

str.replace(u"\u2022","something") 

works wonders ;) , thnx to RParadox for the hint.

Comments

-2

If you want to remove all \u character. Code below for you

def replace_unicode_character(self, content: str):
    content = content.encode('utf-8')
    if "\\x80" in str(content):
        count_unicode = 0
        i = 0
        while i < len(content):
            if "\\x" in str(content[i:i + 1]):
                if count_unicode % 3 == 0:
                    content = content[:i] + b'\x80\x80\x80' + content[i + 3:]
                i += 2
                count_unicode += 1
            i += 1

        content = content.replace(b'\x80\x80\x80', b'')
    return content.decode('utf-8')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.