2

I have a 1D numpy array of strings that I need to convert to a new dtype. The new type may be either an int, float, or datetime type. Some of the strings may be invalid for that type and cannot be converted, which leads to an error, e.g.:

>>> np.array(['10', '20', 'a'], dtype=int)
...
ValueError: invalid literal for int() with base 10: 'a'

I want to find the index of that invalid value, in this case 2. Currently I can only think of two solutions, neither of which are great:

  1. Parse the exception message with a regex to find the invalid value, then find the index of that value in the original array. This seems messy and error-prone.
  2. Parse the values in a loop in Python. This would probably be significantly slower than a numpy version. For example, here's an experiment I did:
from timeit import timeit
import numpy as np

strings = np.array(list(map(str, range(10000000))))


def python_parse(arr):
    result = []
    for i, x in enumerate(arr):
        try:
            result.append(int(x))
        except ValueError:
            raise Exception(f'Failed at: {i}')


print(timeit(lambda: np.array(strings, dtype=int), number=10))  # 35 seconds
print(timeit(lambda: python_parse(strings), number=10))         # 52 seconds

This seems like a simple and common enough operation that I expect a solution to be built into the numpy library, but I can't find one.

3 Answers 3

3

You can use np.core.defchararray.isdigit() to find the indices of the digits and then use a logical-not operand to get the indices of nan-digit items. Afterward you can just use np.where() to get the respective indices:

In [20]: arr = np.array(['10', '20', 'a', '4', '%'])

In [24]: np.where(~np.core.defchararray.isdigit(arr))
Out[24]: (array([2, 4]),)

If you want to check for multiple types like float you can use a custom function and then using np.vectorize apply the function to your array. For dates it's a little bit tricky but if you want a general way for that you may want to use dateutils.parser().

You can use a function like following:

# from dateutils import parser
In [33]: def check_type(item):
    ...:     try:
    ...:         float(item)
    ...:     except:
    ...:         try:         
    ...:             parser.parse(item)
    ...:         except:     
    ...:             return True
    ...:         else:      
    ...:             return False
    ...:     else:          
    ...:         return False

Then:

vector_func = np.vectorize(check_type)
np.where(vector_func(arr))

Demo:

In [45]: arr = np.array(['10.34', '-20', 'a', '4', '%', '2018-5-01'])

In [46]: vector_func = np.vectorize(check_type)
    ...: np.where(vector_func(arr))
    ...: 
Out[46]: (array([2, 4]),)
Sign up to request clarification or add additional context in comments.

3 Comments

But I can't use this for floats, datetimes, or negative numbers.
@AlexHall For those cases you have to use a Python-based approach.
Thanks for your effort. In the end I managed to solve this and I'm afraid I prefer my solution, particularly because it stops at the first error, but I like your idea.
1

It turns out that I overestimated the difference between Python and numpy, and while the Python code I put in the question is quite slow, it can be made much faster using a preallocated array:

def python_parse(arr):
    result = np.empty(shape=(len(arr),), dtype=int)
    for i, x in enumerate(arr):
        try:
            result[i] = x
        except ValueError:
            raise Exception(f'Failed at: {i}')
    return result

This produces errors correctly and is almost as fast as simply np.array(strings, dtype=int) (which seriously surprised me).

5 Comments

Note that this seems to only give meaningful output on 1D arrays. Try arr = np.array(['10', '20', 'a', '4', '%', '2']).reshape(3, 2). I guess you'd have to ravel() higher dimensions and then work backwards.
I wonder how surprised you'll be once you try this: change enumerate(arr) to enumerate(arr.tolist()) and timeit again.
@PaulPanzer I am very surprised, thank you! At first I was really shocked because I thought this made numpy slower than Python, but I see np.array(strings, dtype=int) also becomes much faster when I add .tolist().
No need to panic, there is an explanation: the __getitem__ method for arrays is significantly more expensive than that for lists. Because (1) it has to be able to parse much more complex indices (2) it has to create a Python object from the "C" element stored in the array whereas the list only needs to return a reference. Now, obviously, tolist must create those objects, too, but I'd assume it's cheaper when done in bulk. (3) tolist returns native Python objects (like int as opposed to np.int64) where possible while __getitem__ doesn't. This also seems to favor list access.
@PaulPanzer still, it seems like something that could be improved in numpy. I think I'll open an issue.
0

I would do something like:

custom_type=int
i = 0
l = ['10', '20', 'a']
acc = np.array([], dtype=custom_type)
for elem in l:
    try:
       acc = np.concatenate((acc, np.array([elem], dtype=custom_type)))
       i += 1
    except:
       print("Failed to convert the type of the element in position {}".format(i))

2 Comments

I suspect this is significantly more inefficient than just iterating through a regular Python list, which the OP is already reluctant to do.
As @roganjosh said, I'm trying to avoid a Python loop.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.