Python - Decode utf-8 list in lists (decode entire list objects)

Question

Assume that i have a list which has more than one list for example:

l = [['a'],['a','b'],['c'],['d',['a','b'],'f']]

with this:

l = [x.decode('UTF8') for x in l]

probably i will get error: list object has no attribute 'decode'

("l" list created from tokenized text which has its every words made list object. Tried many solution for overcome decode struggle but still cant print non-ascii characters)

with open(path, "r") as myfile:
    text=myfile.read()

text = word_tokenize(text)

d = [[item] if not isinstance(item, list) else item for item in text]

arr = sum(([[x[0] for x in g]] if k else list(g)
     for k, g in groupby(d, key=lambda x: x[0][0].isupper())),
    [])

arr = [x.decode('UTF8') for x in arr]

INPUT (my text file):

Çanakkale çok güzel bir şehirdir. Çok beğendik.

OUTPUT :

[[u'\xc7anakkale'], [u'\xe7ok'], [u'g\xfczel'], [u'bir'], [u'\u015fehirdir'], [u'.']. [u'\xe7ok'], [u'be\u011fendik'], [u'.']]

my desired output is list but exactly like my input format.

i think so i have a lot of non-ascii characters but i want print them with the exact structure (words contains ü ğ ş ı ç) — Arda Nalbant
– Arda Nalbant, Commented Apr 30, 2016 at 11:55
Please provide a minimal reproducible example and the design output — Alastair McCormack
– Alastair McCormack, Commented Apr 30, 2016 at 16:02

Alastair McCormack · Accepted Answer · 2016-04-30 16:00:05Z

Firstly, the problem you think you have is that you're printing the whole list (you haven't included that part in your question so I've had to guess) - Python is printing the safe representation of the data. For you this means it's indicating you have Unicode strings (hence the u'') and it's showing the Unicode point hex value of the non-ASCII characters.

If you were to print an individual part of the list then you'd get what you expect.

I.e.

>>> print arr[0][0]
Çanakkale

If you want to print all the values the you'll need a for loop:

for x in arr:
    for y in x:
        print y

You're also introducing unnecessary complexity by manually decoding the data deep in your code - instead you should decode the data on input.

It appears that you're using Python 2.x (by the u'' prefixes), so use the io module to decode the text data as you read it:

import io
with io.open(path, "r", encoding="utf-8") as myfile:
    text=myfile.read()

Now you can remove the arr = [x.decode('UTF8') for x in arr] line.

niemmi · Accepted Answer · 2016-04-30 14:19:41Z

3

You could do the decoding with simple recursive function:

l1 = [['a'],['a','b'],['c'],['d',['a','b'],'f']]

def decode(l):
    if isinstance(l, list):
        return [decode(x) for x in l]
    else:
        return l.decode('utf-8')

decode(l1) # [[u'a'], [u'a', u'b'], [u'c'], [u'd', [u'a', u'b'], u'f']]

edited Apr 30, 2016 at 14:19

answered Apr 30, 2016 at 11:36

niemmi

17.3k7 gold badges38 silver badges42 bronze badges

5 Comments

Arda Nalbant Over a year ago

thanks. tried this but did i create the question in wrong way ? Cause after decode(urf-8) i tought i will get the words contains (ü ğ ş ı ç)

niemmi Over a year ago

It would help if you would add problematic input and expected output to the question itself.

Natecat Over a year ago

I'm pretty sure the encoding is utf-8 not UTF8

niemmi Over a year ago

@Natecat Thanks, changed the answer accordingly

Arda Nalbant Over a year ago

when i try to print l still words in list are non-ascii. Also my aim is Compare list object in my database. I have a location set and Çanakkkale is location but if my word is '\xc7anakkale' it wont match and will give me error

Collectives™ on Stack Overflow

Python - Decode utf-8 list in lists (decode entire list objects)

2 Answers 2

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related