Comparing unicode with unicode in python

Question

I am trying to count the number of same words in an Urdu document which is saved in UTF-8.

so for example I have document containing 3 exactly same words separated by space

خُداوند خُداوند خُداوند

I tried to count the words by reading the file using the following code:

        file_obj = codecs.open(path,encoding="utf-8")
        lst = repr(file_obj.readline()).split(" ")
        word = lst[0]
        count =0
        for w in lst:
            if word == w:
                count += 1
        print count

but the value of count I am getting is 1 while I should get 3.

How does one compare Unicode strings?

What does lst print as? I have [u'\u062e\u064f\u062f\u0627\u0648\u0646\u062f', u'\u062e\u064f\u062f\u0627\u0648\u0646\u062f', u'\u062e\u064f\u062f\u0627\u0648\u0646\u062f'] and those are exactly identical (your code works). But if there are any denormalized forms then they won't be identical. — Martijn Pieters
– Martijn Pieters, Commented Nov 3, 2013 at 10:14
See Normalizing Unicode for the proper way to handle Unicode values with denormalized codepoints. — Martijn Pieters
– Martijn Pieters, Commented Nov 3, 2013 at 10:15
And remove the repr(). You just added u' and ' to the start and end of the string. So word is now "u'\u062e\u064f\u062f\u0627\u0648\u0646\u062f, lst[1] is '\u062e\u064f\u062f\u0627\u0648\u0646\u062f' and list[2] is "\u062e\u064f\u062f\u0627\u0648\u0646\u062f'". These strings are obviously not equal. — Martijn Pieters
– Martijn Pieters, Commented Nov 3, 2013 at 10:25
removing repr() gives me an error: File "C:\Python27\lib\encodings\cp437.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-7: character maps to <undefined> — mdanishs
– mdanishs, Commented Nov 3, 2013 at 10:29
Not on that line it won't. Are you trying to print these values somewhere too? — Martijn Pieters
– Martijn Pieters, Commented Nov 3, 2013 at 10:30

Community · Accepted Answer · 2017-05-23 11:51:50Z

Remove the repr() from your code. Use repr() only to create debug output; you are turning a unicode value into a string that can be pasted back into the interpreter.

This means your line from the file is now stored as:

>>> repr(u'خُداوند خُداوند خُداوند\n').split(" ")
["u'\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f", '\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f', "\\u062e\\u064f\\u062f\\u0627\\u0648\\u0646\\u062f\\n'"]

Note the double backslashes (escaped unicode escapes) and the first string starts with u' and the last string ends with \\n'. These values are obviously never equal.

Remove the repr(), and use .split() without arguments to remove the trailing whitespace too:

lst = file_obj.readline().split()

and your code will work:

>>> res = u'خُداوند خُداوند خُداوند\n'.split()
>>> res[0] == res[1] == res[2]
True

You may need to normalize the input first; some characters can be expressed either as one unicode codepoint or as two combining codepoints. Normalizing moves all such characters to a composed or decomposed state. See Normalizing Unicode.

satoru · Accepted Answer · 2013-11-03 10:16:22Z

1

Try removing the repr?

lst = file_obj.readline().split(" ")

The point is that you should at least print variables like lst and w to see what they are.

answered Nov 3, 2013 at 10:16

satoru

33.4k35 gold badges100 silver badges151 bronze badges

3 Comments

Martijn Pieters Over a year ago

Even with the repr() the sample input still works; of course the OP should not be using that, however.

Martijn Pieters Over a year ago

Ah, no, the first string will have u', the second has no quotes, the third a trailing '.

Martijn Pieters Over a year ago

Your opening sentence Try removing the repr? suggested otherwise.

Artur · Accepted Answer · 2013-11-03 10:21:03Z

0

Comparing unicode strings in Python:

a = u'Artur'
print(a)
b = u'\u0041rtur'
print(b)

if a == b:
    print('the same')

result:

Artur
Artur
the same

answered Nov 3, 2013 at 10:21

Artur

7,3572 gold badges28 silver badges40 bronze badges

Collectives™ on Stack Overflow

Comparing unicode with unicode in python

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related