2

I need to work with text, words like comparing words with dictionary... and I have problem with encoding. txt file is utf-8, the code is utf-8 too. Problem is when splitting to words with characters like š,č,ť,á,... I tried to encode and decode and searched on web but I dont know what to do with it. I looked at filesystemencoding, it is mbcs and defaultencoding is utf-8. Can you somebody help me? Code below is first version.

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    f = open("text.txt", "r+")

    text = f.read()

    sentences = re.split("[.!?]\s", text)

    words = re.split("\s", sentences[0])

    print sentences[0]
    print words

and result is:

Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny

['\xef\xbb\xbfNexus', '5', 'patr\xc3\xad', 'su\xc4\x8dasnosti', 'medzi', 'najlep\xc5\xa1ie', 'smartf\xc3\xb3ny']

When I use:

f = codecs.open("text.txt", "r+", encoding="utf-8")

result is:

Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny

[u'\ufeffNexus', u'5', u'patr\xed', u'su\u010dasnosti', u'medzi', u'najlep\u0161ie', u'smartf\xf3ny']

and I need output like:

['Nexus', '5', 'patrí', 'v', 'súčastnosti',....]
3
  • You have unicode strings in a list. If you don't want to print representations, don't print the list container but each element separately. Commented Nov 24, 2013 at 13:56
  • OK now I see but when I want to compare each element of list with dictionary to find a match will it work fine? Commented Nov 24, 2013 at 14:12
  • You'd use unicode literals to test against, but yes. Commented Nov 24, 2013 at 14:31

1 Answer 1

1

The encoding handling is correct, u'patr\xed' is just the representation of a unicode string in Python. Try print u'patr\xed' in a shell to see for yourself.

Having said that, as you seem to want to use it as a dictionary, it might be useful to use the unidecode module to normalize the unicode strings to ASCII.

Sign up to request clarification or add additional context in comments.

2 Comments

I want to compare it with distionary to find a match. How to install unicode with windows? There is only Linux package .
I think the best way is to install pip and then just run the command pip install unidecode. Unidecode is nice for exactly what you want, you can use it to normalize the dictionary words to ASCII and then later you can do the same to the word you want to look for and see if it is in your dictionary.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.