Write numpy unicode array to a text file

Question

I'm trying to export a numpy array that contains unicode elements to a text file.

So far I got the following to work, but doesn't have any unicode character:

import numpy as np

array_unicode=np.array([u'maca' u'banana',u'morango'])

with open('array_unicode.txt','wb') as f:
    np.savetxt(f,array_unicode,fmt='%s')

If I change 'c' from 'maca' to 'ç' I get an error:

import numpy as np

array_unicode=np.array([u'maça' u'banana',u'morango'])

with open('array_unicode.txt','wb') as f:
    np.savetxt(f,array_unicode,fmt='%s')

Traceback:

Traceback (most recent call last):
  File "<ipython-input-48-24ff7992bd4c>", line 8, in <module>
    np.savetxt(f,array_unicode,fmt='%s')
  File "C:\Anaconda2\lib\site-packages\numpy\lib\npyio.py", line 1158, in savetxt
    fh.write(asbytes(format % tuple(row) + newline))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 2: ordinal not in range(128)

How can I set savetxt from numpy to write unicode characters?

How would you write these strings if they were a list or longer text? Would you still open the file in wb mode? — hpaulj
– hpaulj, Commented Apr 19, 2016 at 7:11

hpaulj · Accepted Answer · 2016-04-19 15:29:23Z

In Python3 (ipthon-qt terminal) I can do:

In [12]: b=[u'maça', u'banana',u'morango']

In [13]: np.savetxt('test.txt',b,fmt='%s')

In [14]: cat test.txt
ma�a
banana
morango

In [15]: with open('test1.txt','w') as f:
    ...:     for l in b:
    ...:         f.write('%s\n'%l)
    ...:         

In [16]: cat test1.txt
maça
banana
morango

savetxt in both Py2 and 3 insists on saving in 'wb', byte mode. Your error line has that asbytes function.

In my example b is a list, but that doesn't matter.

In [17]: c=np.array(['maça', 'banana','morango'])

In [18]: c
Out[18]: 
array(['maça', 'banana', 'morango'], 
      dtype='<U7')

writes the same. In py3 the default string type is unicode, so the u tag isn't needed - but is ok.

In Python2 I get your error with a plain write

>>> b=[u'maça' u'banana',u'morango']
>>> with open('test.txt','w') as f:
...    for l in b:
...        f.write('%s\n'%l)
... 
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 2: ordinal not in range(128)

adding the encode gives a nice output:

>>> b=[u'maça', u'banana',u'morango']
>>> with open('test.txt','w') as f:
...    for l in b:
...        f.write('%s\n'%l.encode('utf-8'))
0729:~/mypy$ cat test.txt
maça
banana
morango

encode is a string method, so has to be applied to the individual elements of an array (or list).

Back on the py3 side, if I use the encode I get:

In [26]: c1=np.array([l.encode('utf-8') for l in b])

In [27]: c1
Out[27]: 
array([b'ma\xc3\xa7a', b'banana', b'morango'], 
      dtype='|S7')

In [28]: np.savetxt('test.txt',c1,fmt='%s')

In [29]: cat test.txt
b'ma\xc3\xa7a'
b'banana'
b'morango'

but with the correct format, the plain write works:

In [33]: with open('test1.txt','wb') as f:
    ...:     for l in c1:
    ...:         f.write(b'%s\n'%l)
    ...:         

In [34]: cat test1.txt
maça
banana
morango

Such are the joys of mixing unicode and the 2 Python generations.

In case it helps, here's the code for the np.lib.npyio.asbytes function that np.savetxt uses (along with the wb file mode):

def asbytes(s):    # py3?
    if isinstance(s, bytes):
        return s
    return str(s).encode('latin1')

(note the encoding is fixed as 'latin1').

The np.char library applies a variety of string methods to the elements of a numpy array, so the np.array([x.encode...]) can be expressed as:

In [50]: np.char.encode(b,'utf-8')
Out[50]: 
array([b'ma\xc3\xa7a', b'banana', b'morango'], 
      dtype='|S7')

This can be convenient, though past testing indicates that it is not a time saver. It still has to apply the Python method to each element.

l'L'l · Accepted Answer · 2016-04-19 03:52:48Z

3

There are many ways you can accomplish this, however, numpy arrays need to be setup in very specific ways (usually using a dtype) to allow unicode characters in these circumstances.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import numpy as np

dt = np.dtype(str, 10)
array_unicode=np.array(['maça','banana','morangou'], dtype=dt)

with open('array_unicode.txt','wb') as f:
    np.savetxt(f, array_unicode, fmt='%s')

You'll need to be aware of the string length in your array as well as the length you decide to setup within the dtype. If it's too short you'll truncate your data, if it's too long it's wasteful. I suggest you read the Numpy data type objects (dtype) documentation, as there are many other ways you might consider setting up the array depending on the data format.

↳ http://docs.scipy.org/doc/numpy-1.9.3/reference/arrays.dtypes.html

Here's an alternative function that could do the conversion to unicode before saving:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import numpy as np

array_unicode=np.array([u'maça',u'banana',u'morangou'])

def uniArray(array_unicode):
    items = [x.encode('utf-8') for x in array_unicode]
    array_unicode = np.array([items]) # remove the brackets for line breaks
    return array_unicode

with open('array_unicode.txt','wb') as f:
    np.savetxt(f, uniArray(array_unicode), fmt='%s')

Basically your np.savetxt will call uniArray for a quick conversion, then back. There might be better ways to than this, although it's been a while since I've used numpy; it's always seemed to be somewhat touchy with encodings.

edited Apr 19, 2016 at 3:52

answered Apr 19, 2016 at 0:34

l'L'l

47.5k12 gold badges102 silver badges154 bronze badges

8 Comments

umLu Over a year ago

One of the main reasons to use numpy is to avoid loops. Isn't there a solution using a numpy method or function?

l'L'l Over a year ago

I'm not clear on why you're even using the u in front of your strings.... Is there some compelling reason for that? If you declare an encoding you wouldn't need that. Also there are other ways to do this, and you should read the documentation about dtypes and how to use them.

hpaulj Over a year ago

savetxt uses a loop, doing a file.write(fmt%tuple(row)) for each row of your array. And any encoding/decodings will be done iteratively with string methods. Using arrays here won't save you time.

umLu Over a year ago

I'm using 'u' in front of my strings to illustrate my problem, since I want words like 'maça' in my final array. Nice to know that 'savetxt' uses a loop, but I believe that is an optimized one for arrays right?

umLu Over a year ago

I believe my question is about unicode and numpy. Thank you for your answer, but I was looking for an integrated way to work with numpy and 'savetxt', if possible.

|

Collectives™ on Stack Overflow

Write numpy unicode array to a text file

2 Answers 2

Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related