1

What is the best way to compare a string entered by the user with another string?

For example:

# -*- coding: utf-8 -*-

from __future__ import unicode_literals

user_input = raw_input("Please, write árido: ").decode("utf8")
if u"árido" == user_input:
    print "OK"
else:
    print "FALSE"

EDIT:

This

# -*- coding: utf-8 -*-

from __future__ import unicode_literals
from unicodedata import normalize
import sys

uinput2 = "árbol"
uinput = raw_input("type árbol: ")

print "Encoding %s" % sys.stdout.encoding
print "User Input \t\tProgram Input"
print "-"*50
print "%s \t\t\t%s \t(raw value)" % (uinput, uinput2)
print "%s \t\t\t%s \t(unicode(value))" % (unicode(uinput), unicode(uinput2))
print "%s \t\t\t%s \t(value.decode('utf8'))" % (uinput.decode("utf-8"), uinput2.decode("utf-8"))
print "%s \t\t\t%s \t(normalize('NFC',value))" % (normalize("NFC",uinput.decode("utf-8")), normalize("NFC",uinput2.decode("utf-8")));
print "\n\nUser Input \t\tProgram Input (Repr)"
print "-"*50
print "%s \t%s" % (repr(uinput),repr(uinput2))
print "%s \t%s \t(unicode(value))" % (repr(unicode(uinput)), repr(uinput2))
print "%s \t%s \t(value.decode('utf8'))" % (repr(uinput.decode("utf-8")), repr(uinput2.decode("utf-8")))
print "%s \t%s \t(normalize('NFC',value)))" % (repr(normalize("NFC",uinput.decode("utf-8"))), repr(normalize("NFC",uinput2.decode("utf-8"))));

prints:

type árbol: árbol
Encoding utf-8
User Input      Program Input
--------------------------------------------------
árbol          árbol   (raw value)
árbol          árbol   (unicode(value))
árbol          árbol   (value.decode('utf8'))
árbol          árbol   (normalize('NFC',value))


User Input              Program Input (Repr)
--------------------------------------------------
'\xc3\x83\xc2\xa1rbol'  u'\xe1rbol'
u'\xc3\xa1rbol'         u'\xe1rbol'     (unicode(value))
u'\xc3\xa1rbol'         u'\xe1rbol'     (value.decode('utf8'))
u'\xc3\xa1rbol'         u'\xe1rbol'     (normalize('NFC',value)))

Any idea? I haven't problems when I work with other languages ​​like Java. This only happens to me with python. I'm using Eclipse.

Thanks in advance :)

10
  • Best way? What are the problems with your way, and what do you want to improve? Commented Jul 17, 2013 at 17:30
  • The comparison always returns false :( Commented Jul 17, 2013 at 17:40
  • 1
    Please include the output of print repr(u"árido") and print repr(user_input). Commented Jul 17, 2013 at 17:44
  • u'\xe1rido' and '\xc3\x83\xc2\xa1rido' Commented Jul 17, 2013 at 17:54
  • I would definitely expect user_input to be a Unicode string if you have done decode("utf8") on the raw_input, is this from using the raw_input result without the decode? Commented Jul 17, 2013 at 18:05

2 Answers 2

1

Can you check the character encoding of your terminal,

import sys

sys.stdin.encoding

If it is UTF-8, then decode should be fine. Otherwise, you have to decode the raw_input with right encoding.

like, raw_input().decode(sys.stdin.encoding) to check whether it is proper along with Unicode Normalization, if needed.

Sign up to request clarification or add additional context in comments.

1 Comment

print sys.stdin.encoding prints utf-8
0

Your current approach isn't bad, but you should probably use unicodedata.normalize() for the comparison. The docs linked above explain why this is a good idea. For example, try evaluating the following:

u'Ç' == u'Ç'

Spoiler alert, this will give you False because the left side is the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA), and the right side is the single character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA).

You can use unicodedata.normalize() to handle this properly by first converting the strings to a normalized form. For example:

# -*- coding: utf-8 -*-
from unicodedata import normalize

from __future__ import unicode_literals

user_input = normalize('NFC', raw_input("Please, write árido: ").decode("utf8"))
if normalize('NFC', u"árido") == user_input:
    print "OK"
else:
    print "FALSE"

1 Comment

this does not work user_input = normalize('NFC',raw_input("Please, write árido: ").decode("utf8")) if normalize('NFC', u"árido") == user_input: print "OK" else: print "FALSE"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.