Find index where type conversion fails in numpy array

Question

I have a 1D numpy array of strings that I need to convert to a new dtype. The new type may be either an int, float, or datetime type. Some of the strings may be invalid for that type and cannot be converted, which leads to an error, e.g.:

>>> np.array(['10', '20', 'a'], dtype=int)
...
ValueError: invalid literal for int() with base 10: 'a'

I want to find the index of that invalid value, in this case 2. Currently I can only think of two solutions, neither of which are great:

Parse the exception message with a regex to find the invalid value, then find the index of that value in the original array. This seems messy and error-prone.
Parse the values in a loop in Python. This would probably be significantly slower than a numpy version. For example, here's an experiment I did:

from timeit import timeit
import numpy as np

strings = np.array(list(map(str, range(10000000))))


def python_parse(arr):
    result = []
    for i, x in enumerate(arr):
        try:
            result.append(int(x))
        except ValueError:
            raise Exception(f'Failed at: {i}')


print(timeit(lambda: np.array(strings, dtype=int), number=10))  # 35 seconds
print(timeit(lambda: python_parse(strings), number=10))         # 52 seconds

This seems like a simple and common enough operation that I expect a solution to be built into the numpy library, but I can't find one.

Kasravnd · Accepted Answer · 2018-04-26 18:20:29Z

3

You can use np.core.defchararray.isdigit() to find the indices of the digits and then use a logical-not operand to get the indices of nan-digit items. Afterward you can just use np.where() to get the respective indices:

In [20]: arr = np.array(['10', '20', 'a', '4', '%'])

In [24]: np.where(~np.core.defchararray.isdigit(arr))
Out[24]: (array([2, 4]),)

If you want to check for multiple types like float you can use a custom function and then using np.vectorize apply the function to your array. For dates it's a little bit tricky but if you want a general way for that you may want to use dateutils.parser().

You can use a function like following:

# from dateutils import parser
In [33]: def check_type(item):
    ...:     try:
    ...:         float(item)
    ...:     except:
    ...:         try:         
    ...:             parser.parse(item)
    ...:         except:     
    ...:             return True
    ...:         else:      
    ...:             return False
    ...:     else:          
    ...:         return False

Then:

vector_func = np.vectorize(check_type)
np.where(vector_func(arr))

Demo:

In [45]: arr = np.array(['10.34', '-20', 'a', '4', '%', '2018-5-01'])

In [46]: vector_func = np.vectorize(check_type)
    ...: np.where(vector_func(arr))
    ...: 
Out[46]: (array([2, 4]),)

edited Apr 26, 2018 at 18:20

answered Apr 26, 2018 at 17:49

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alex Hall Over a year ago

But I can't use this for floats, datetimes, or negative numbers.

Kasravnd Over a year ago

@AlexHall For those cases you have to use a Python-based approach.

Alex Hall Over a year ago

Thanks for your effort. In the end I managed to solve this and I'm afraid I prefer my solution, particularly because it stops at the first error, but I like your idea.

Alex Hall · Accepted Answer · 2018-04-26 18:26:41Z

1

It turns out that I overestimated the difference between Python and numpy, and while the Python code I put in the question is quite slow, it can be made much faster using a preallocated array:

def python_parse(arr):
    result = np.empty(shape=(len(arr),), dtype=int)
    for i, x in enumerate(arr):
        try:
            result[i] = x
        except ValueError:
            raise Exception(f'Failed at: {i}')
    return result

This produces errors correctly and is almost as fast as simply np.array(strings, dtype=int) (which seriously surprised me).

answered Apr 26, 2018 at 18:26

Alex Hall

36.2k5 gold badges63 silver badges98 bronze badges

5 Comments

roganjosh Over a year ago

Note that this seems to only give meaningful output on 1D arrays. Try arr = np.array(['10', '20', 'a', '4', '%', '2']).reshape(3, 2). I guess you'd have to ravel() higher dimensions and then work backwards.

Paul Panzer Over a year ago

I wonder how surprised you'll be once you try this: change enumerate(arr) to enumerate(arr.tolist()) and timeit again.

Alex Hall Over a year ago

@PaulPanzer I am very surprised, thank you! At first I was really shocked because I thought this made numpy slower than Python, but I see np.array(strings, dtype=int) also becomes much faster when I add .tolist().

Paul Panzer Over a year ago

No need to panic, there is an explanation: the __getitem__ method for arrays is significantly more expensive than that for lists. Because (1) it has to be able to parse much more complex indices (2) it has to create a Python object from the "C" element stored in the array whereas the list only needs to return a reference. Now, obviously, tolist must create those objects, too, but I'd assume it's cheaper when done in bulk. (3) tolist returns native Python objects (like int as opposed to np.int64) where possible while __getitem__ doesn't. This also seems to favor list access.

Alex Hall Over a year ago

@PaulPanzer still, it seems like something that could be improved in numpy. I think I'll open an issue.

Giovanni Rescia · Accepted Answer · 2018-04-26 17:50:37Z

0

I would do something like:

custom_type=int
i = 0
l = ['10', '20', 'a']
acc = np.array([], dtype=custom_type)
for elem in l:
    try:
       acc = np.concatenate((acc, np.array([elem], dtype=custom_type)))
       i += 1
    except:
       print("Failed to convert the type of the element in position {}".format(i))

answered Apr 26, 2018 at 17:50

Giovanni Rescia

5905 silver badges15 bronze badges

2 Comments

roganjosh Over a year ago

I suspect this is significantly more inefficient than just iterating through a regular Python list, which the OP is already reluctant to do.

Alex Hall Over a year ago

As @roganjosh said, I'm trying to avoid a Python loop.

Collectives™ on Stack Overflow

Find index where type conversion fails in numpy array

3 Answers 3

3 Comments

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related