wrong text encoding in python

Question

I need to work with text, words like comparing words with dictionary... and I have problem with encoding. txt file is utf-8, the code is utf-8 too. Problem is when splitting to words with characters like š,č,ť,á,... I tried to encode and decode and searched on web but I dont know what to do with it. I looked at filesystemencoding, it is mbcs and defaultencoding is utf-8. Can you somebody help me? Code below is first version.

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    f = open("text.txt", "r+")

    text = f.read()

    sentences = re.split("[.!?]\s", text)

    words = re.split("\s", sentences[0])

    print sentences[0]
    print words

and result is:

Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny

['\xef\xbb\xbfNexus', '5', 'patr\xc3\xad', 'su\xc4\x8dasnosti', 'medzi', 'najlep\xc5\xa1ie', 'smartf\xc3\xb3ny']

When I use:

f = codecs.open("text.txt", "r+", encoding="utf-8")

result is:

Nexus 5 patrí v sučasnosti medzi a najlepšie aj smartfóny

[u'\ufeffNexus', u'5', u'patr\xed', u'su\u010dasnosti', u'medzi', u'najlep\u0161ie', u'smartf\xf3ny']

and I need output like:

['Nexus', '5', 'patrí', 'v', 'súčastnosti',....]

You have unicode strings in a list. If you don't want to print representations, don't print the list container but each element separately. — Martijn Pieters
– Martijn Pieters, Commented Nov 24, 2013 at 13:56
OK now I see but when I want to compare each element of list with dictionary to find a match will it work fine? — TheBP
– TheBP, Commented Nov 24, 2013 at 14:12

Elias Dorneles · Accepted Answer · 2013-11-24 13:57:32Z

1

The encoding handling is correct, u'patr\xed' is just the representation of a unicode string in Python. Try print u'patr\xed' in a shell to see for yourself.

Having said that, as you seem to want to use it as a dictionary, it might be useful to use the unidecode module to normalize the unicode strings to ASCII.

answered Nov 24, 2013 at 13:57

Elias Dorneles

24.2k12 gold badges91 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

TheBP Over a year ago

I want to compare it with distionary to find a match. How to install unicode with windows? There is only Linux package .

Elias Dorneles Over a year ago

I think the best way is to install pip and then just run the command pip install unidecode. Unidecode is nice for exactly what you want, you can use it to normalize the dictionary words to ASCII and then later you can do the same to the word you want to look for and see if it is in your dictionary.

Collectives™ on Stack Overflow

wrong text encoding in python

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related