I am trying to count the number of same words in an Urdu document which is saved in UTF-8.
so for example I have document containing 3 exactly same words separated by space
خُداوند خُداوند خُداوند
I tried to count the words by reading the file using the following code:
file_obj = codecs.open(path,encoding="utf-8")
lst = repr(file_obj.readline()).split(" ")
word = lst[0]
count =0
for w in lst:
if word == w:
count += 1
print count
but the value of count I am getting is 1 while I should get 3.
How does one compare Unicode strings?
lstprint as? I have[u'\u062e\u064f\u062f\u0627\u0648\u0646\u062f', u'\u062e\u064f\u062f\u0627\u0648\u0646\u062f', u'\u062e\u064f\u062f\u0627\u0648\u0646\u062f']and those are exactly identical (your code works). But if there are any denormalized forms then they won't be identical.repr(). You just addedu'and'to the start and end of the string. Sowordis now"u'\u062e\u064f\u062f\u0627\u0648\u0646\u062f,lst[1]is'\u062e\u064f\u062f\u0627\u0648\u0646\u062f'andlist[2]is"\u062e\u064f\u062f\u0627\u0648\u0646\u062f'". These strings are obviously not equal.