I have a 1D numpy array of strings that I need to convert to a new dtype. The new type may be either an int, float, or datetime type. Some of the strings may be invalid for that type and cannot be converted, which leads to an error, e.g.:
>>> np.array(['10', '20', 'a'], dtype=int)
...
ValueError: invalid literal for int() with base 10: 'a'
I want to find the index of that invalid value, in this case 2. Currently I can only think of two solutions, neither of which are great:
- Parse the exception message with a regex to find the invalid value, then find the index of that value in the original array. This seems messy and error-prone.
- Parse the values in a loop in Python. This would probably be significantly slower than a numpy version. For example, here's an experiment I did:
from timeit import timeit
import numpy as np
strings = np.array(list(map(str, range(10000000))))
def python_parse(arr):
result = []
for i, x in enumerate(arr):
try:
result.append(int(x))
except ValueError:
raise Exception(f'Failed at: {i}')
print(timeit(lambda: np.array(strings, dtype=int), number=10)) # 35 seconds
print(timeit(lambda: python_parse(strings), number=10)) # 52 seconds
This seems like a simple and common enough operation that I expect a solution to be built into the numpy library, but I can't find one.