3

I want to read data from a (very large, whitespace separated, two-column) text file into a Python dictionary. I tried to do this with a for-loop but that was too slow. MUCH faster is reading it with numpy loadtxt into a struct array and then converting it to a dictionary:

data = np.loadtxt('filename.txt', dtype=[('field1', 'a20'), ('field2', int)], ndmin=1)
result = dict(data)

But this is surely not the best way? Any advice?

The main reason I need something else, is that the following does not work:

data[0]['field1'].split(sep='-')

It leads to the error message:

TypeError: Type str doesn't support the buffer API

If the split() method exists, why can't I use it? Should I use a different dtype? Or is there a different (fast) way to read the text file? Is there anything else I am missing?

Versions: python version 3.3.2 numpy version 1.7.1

Edit: changed data['field1'].split(sep='-') to data[0]['field1'].split(sep='-')

1
  • One of these days I am going to have to try and understand unicode... By the way, the right thing to do is to write the answer as a proper answer and accept it, not to include it within your question. Commented Jul 30, 2013 at 19:45

2 Answers 2

3

The standard library split returns a variable number of arguments, depending on how many times the separator is found in the string, and is therefore not very suitable for array operations. My char numpy arrays (I'm running 1.7) do not have a split method, by the way.

You do have np.core.defchararray.partition, which is similar but poses no problems for vectorization, as well as all the other string operations:

>>> a = np.array(['a - b', 'c - d', 'e - f'], dtype=np.string_)
>>> a
array(['a - b', 'c - d', 'e - f'], 
      dtype='|S5')
>>> np.core.defchararray.partition(a, '-')
array([['a ', '-', ' b'],
       ['c ', '-', ' d'],
       ['e ', '-', ' f']], 
      dtype='|S2')
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you for your answer Jaime. What I meant was data**[0]**['field1'].split(sep='-'), not data['field1'].split(sep='-') although the latter would be brilliant if it existed and was fast. I edited my above post accordingly.
With my made up exmaple I can run a[0].split('-'), which should be equivalent to data['field1'][0].split(sep='-'), so reversing the order of your indices. How many - are you expecting in your strings?
With your example I get: >>> np.core.defchararray.partition(a, '-') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.3/site-packages/numpy/core/defchararray.py", line 1090, in partition _vec_string(a, object_, 'partition', (sep,))) TypeError: expected bytes, bytearray or buffer compatible object
Then go with partition, and split all your strings with a single call.
actually, just b'a-b'.split(b'-') is OK.
|
1

Because: type(data[0]['field1']) gives <class 'numpy.bytes_'> , the split() method does not work when it has a "normal" string as argument (is this a bug?)

the way I solved it: data[0]['field1'].split(sep=b'-') (the key to this is to put the b in front of '-')

And of course Jaime's suggestion to use the following was very helpful: np.core.defchararray.partition(a, '-') but also in this case b'-' is needed to make it work.

In fact, a related question was answered here: Type str doesn't support the buffer API although at first sight I did not realise this was the same issue.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.