Parsing the string-representation of a numpy array

Question

If I only have the string-representation of a numpy.array:

>>> import numpy as np
>>> arr = np.random.randint(0, 10, (10, 10))
>>> print(arr)  # this one!
[[9 4 7 3]
 [1 6 4 2]
 [6 7 6 0]
 [0 5 6 7]]

How can I convert this back to a numpy array? It's not complicated to actually insert the , manually but I'm looking for a programmatic approach.

A simple regex replacing whitespaces with , actually works for single-digit integers:

>>> import re
>>> sub = re.sub('\s+', ',', """[[8 6 2 4 0 2]
...  [3 5 8 4 5 6]
...  [4 6 3 3 0 3]]
... """)
>>> sub
'[[8,6,2,4,0,2],[3,5,8,4,5,6],[4,6,3,3,0,3]],'  # the trailing "," is a bit annoying

It can be converted to an almost (dtype may be lost but that's okay) identical array:

>>> import ast
>>> np.array(ast.literal_eval(sub)[0])
array([[8, 6, 2, 4, 0, 2],
       [3, 5, 8, 4, 5, 6],
       [4, 6, 3, 3, 0, 3]])

But it fails for multidigit integers and floats:

>>> re.sub('\s+', ',', """[[ 0.  1.  6.  9.  1.  4.]
... [ 4.  8.  2.  3.  6.  1.]]
... """)
'[[,0.,1.,6.,9.,1.,4.],[,4.,8.,2.,3.,6.,1.]],'

because these have an additional , at the beginning.

A solution doesn't necessarily need to be based on regex, any other approach that works for unabriged (not shortened with ...) bool/int/float/complex arrays with 1-4 dimensions would be ok.

Considering that NumPy abridges the string representation (even the repr) of large arrays by default, doing this is probably a bad idea. — user2357112
– user2357112, Commented May 9, 2017 at 20:35
("Large" being anything with strictly more than 1000 elements, unless you configure a different threshold.) — user2357112
– user2357112, Commented May 9, 2017 at 20:38
@user2357112 For the purpose of this question it's safe to assume that the string (of the array) is unabriged. — MSeifert
– MSeifert, Commented May 9, 2017 at 20:39

There · Accepted Answer · 2023-08-27 23:07:05Z

3

Here's a pretty manual solution:

import re
import numpy

def parse_array_str(array_string):
    tokens = re.findall(r'''             # Find all...
                            \[         | # opening brackets,
                            \]         | # closing brackets, or
                            [^\[\]\s]+   # sequences of other non-whitespace characters''',
                        array_string,
                        flags = re.VERBOSE)
    tokens = iter(tokens)

    # Chomp first [, handle case where it's not a [
    first_token = next(tokens)
    if first_token != '[':
        # Input must represent a scalar
        if next(tokens, None) is not None:
            raise ValueError("Can't parse input.")
        return float(first_token)  # or int(token), but not bool(token) for bools

    list_form = []
    stack = [list_form]

    for token in tokens:
        if token == '[':
            # enter a new list
            stack.append([])
            stack[-2].append(stack[-1])
        elif token == ']':
            # close a list
            stack.pop()
        else:
            stack[-1].append(float(token))  # or int(token), but not bool(token) for bools

    if stack:
        raise ValueError("Can't parse input - it might be missing text at the end.")

    return numpy.array(list_form)

Or a less manual solution, based on detecting where to insert commas:

import re
import numpy
import ast

pattern = r'''# Match (mandatory) whitespace between...
              (?<=\]) # ] and
              \s+
              (?= \[) # [, or
              |
              (?<=[^\[\]\s]) 
              \s+
              (?= [^\[\]\s]) # two non-bracket non-whitespace characters
           '''

# Replace such whitespace with a comma
fixed_string = re.sub(pattern, ',', array_string, flags=re.VERBOSE)

output_array = numpy.array(ast.literal_eval(fixed_string))

edited Aug 27, 2023 at 23:07

There

5166 silver badges22 bronze badges

answered May 9, 2017 at 20:59

user2357112

286k32 gold badges490 silver badges569 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

MSeifert Over a year ago

just .. wow! That's a clever function. With a factory-argument for the dtype (in case of bools a translation function) this correctly parses everything I throw at it!

MSeifert Over a year ago

Just one question, why did you use re.findall instead of re.finditer?

user2357112 Over a year ago

@MSeifert: I didn't want to have to call group() on all the match objects myself.

MaxU - stand with Ukraine · Accepted Answer · 2017-05-09 21:47:52Z

1

UPDATE:

np.array(ast.literal_eval(re.sub(r'\]\s*\[',
                                 r'],[',
                                 re.sub(r'(\d+)\s+(\d+)', 
                                        r'\1,\2', 
                                        a.replace('\n','')))))

Test:

In [345]: a = np.random.rand(3,3,20).__str__()

In [346]: np.array(ast.literal_eval(re.sub(r'\]\s*\[',
     ...:                                  r'],[',
     ...:                                  re.sub(r'(\d+)\s+(\d+)',
     ...:                                         r'\1,\2',
     ...:                                         a.replace('\n','')))))
Out[346]:
array([[[  1.61804506e-01,   8.12734833e-01,   6.35872020e-01,   7.45560321e-01,   7.60322379e-01,   1.50271532e-01,   7.43559134e-01,   5.21169923e-
01,   4.10560219e-01,   1.77891635e-01,
           8.77997042e-01,   5.52165694e-02,   4.40322089e-01,   8.82732323e-01,   3.12101843e-01,   9.49019544e-01,   1.69709407e-01,   5.35675968e-
01,   3.53186538e-01,   2.39804555e-01],
        [  2.59834852e-01,   7.13464074e-01,   4.24374709e-01,   7.45214854e-01,   2.54193920e-01,   9.43753568e-01,   3.19657128e-02,   6.04311934e-
01,   4.58913230e-01,   9.21777675e-01,
           7.60741980e-02,   8.25952339e-01,   1.37270639e-01,   7.42065132e-01,   9.05089275e-01,   9.90206513e-02,   2.00671342e-01,   9.29283429e-
01,   8.87469279e-01,   2.78824797e-01],
        [  5.49303597e-01,   1.68139999e-01,   9.52643331e-01,   8.97801805e-01,   8.34317042e-01,   3.61338265e-01,   1.97822206e-01,   1.44672484e-
01,   4.62311800e-01,   6.45563044e-01,
           3.96650080e-01,   9.66557989e-01,   5.55279111e-01,   6.95327885e-01,   8.77989215e-01,   3.09452892e-01,   4.34898544e-02,   6.18538982e-
01,   6.11605477e-03,   5.30348496e-03]],

       [[  4.67741090e-01,   4.18749234e-01,   4.92742479e-01,   3.12952835e-01,   1.66866007e-01,   1.81524074e-01,   3.48737055e-01,   3.96121943e-
01,   7.56894807e-01,   4.99569007e-02,
           9.48425036e-01,   1.30331685e-01,   3.60872691e-01,   4.98930072e-01,   7.14775531e-01,   5.50048525e-01,   6.12293600e-01,   6.24329775e-
01,   3.74200599e-01,   6.77087300e-01],
        [  3.64029724e-01,   5.12225561e-01,   6.52844356e-01,   1.36063860e-01,   5.95311924e-01,   7.31286536e-01,   3.85353941e-01,   1.17983007e-
01,   3.78948410e-01,   3.66223737e-01,
           4.78195933e-01,   3.46903190e-01,   7.59476546e-01,   4.38877386e-01,   7.33342832e-01,   9.38044045e-01,   6.80193266e-01,   1.76191976e-
01,   2.84027688e-01,   8.85565762e-01],
        [  1.25801396e-01,   7.62014084e-01,   7.57817614e-01,   5.44511396e-01,   2.77615151e-01,   6.94968328e-01,   9.64537639e-01,   7.79804895e-
01,   8.45911428e-01,   1.59562236e-01,
           7.14207030e-01,   9.26019437e-01,   1.84258959e-01,   8.37627772e-01,   9.72586483e-01,   3.87408269e-01,   1.95596555e-01,   3.51684372e-
02,   7.14297398e-02,   2.70039164e-01]],

       [[  3.03855673e-01,   7.72762928e-01,   5.63591643e-01,   7.58142274e-01,   4.71340149e-01,   1.50447988e-01,   4.24416607e-01,   3.53647908e-
01,   4.83022443e-01,   7.72650844e-01,
           4.05579568e-01,   4.64825394e-01,   3.74864150e-01,   8.04635163e-01,   3.29960889e-01,   8.82488417e-01,   6.05332753e-01,   1.84514406e-
01,   4.47145930e-01,   6.96907260e-01],
        [  1.54041028e-01,   2.33380875e-01,   7.34935729e-01,   8.13397766e-01,   6.26194271e-02,   9.40103450e-01,   6.24356287e-01,   2.26074683e-
01,   5.43054373e-01,   7.03495296e-02,
           4.68091539e-02,   7.30366454e-01,   5.27159134e-01,   1.33293015e-01,   4.68391358e-01,   8.25307079e-01,   9.74953928e-01,   2.20242983e-
01,   3.42050900e-01,   7.86851567e-01],
        [  4.49176834e-01,   2.77129577e-01,   1.18051369e-01,   4.99016389e-01,   4.54702611e-04,   2.17932718e-01,   8.83065335e-01,   9.58966789e-
02,   1.52448380e-01,   7.18588641e-01,
           3.73546613e-01,   1.66186769e-01,   5.80381932e-01,   3.42510041e-01,   6.75739930e-01,   1.85382205e-01,   3.26533424e-01,   7.35004900e-
01,   9.22527439e-01,   9.96079190e-01]]])

Old answer:

we could try to use Pandas:

import io
import pandas as pd

In [294]: pd.read_csv(io.StringIO(a.replace('\n', '').replace(']', '\n').replace('[','')),
                      delim_whitespace=True, header=None).values
Out[294]:
array([[ 0.96725219,  0.01808783,  0.63087793,  0.45407222,  0.30586779,  0.04848813,  0.01797095],
       [ 0.87762897,  0.07705762,  0.33049588,  0.91429797,  0.5776607 ,  0.18207652,  0.2355932 ],
       [ 0.68803166,  0.31540537,  0.92606902,  0.83542726,  0.43457601,  0.44952604,  0.35121332],
       [ 0.14366487,  0.23486924,  0.16421432,  0.27709387,  0.19646975,  0.8243488 ,  0.37708642],
       [ 0.07594925,  0.36608386,  0.02087877,  0.07507932,  0.40005067,  0.84625563,  0.62827931],
       [ 0.63662663,  0.41408688,  0.43447501,  0.22135816,  0.58944708,  0.66456168,  0.5871466 ],
       [ 0.16807584,  0.70981667,  0.18597074,  0.02034372,  0.94706437,  0.61333699,  0.8444439 ]])

NOTE: It might work only for 2D arrays without ... (ellipsis)

edited May 9, 2017 at 21:47

answered May 9, 2017 at 20:43

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

9 Comments

user2357112 Over a year ago

I'm pretty sure this fails if the rows get long enough that NumPy starts line wrapping, and it definitely fails for higher-dimensional arrays.

MaxU - stand with Ukraine Over a year ago

@user2357112, yes, thank you ! It might work only for 2D arrays without ... ellipsis

user2357112 Over a year ago

I'm not even talking about ... abridged output; if printing a whole row on one line would take a line width of more than 75 characters, NumPy starts printing rows on multiple lines. That breaks your row detection for even fairly short rows.

MSeifert Over a year ago

yeah, lots of nans with the string representation of something like np.random.random((7, 7)), I copied the attempt + result in this gist

MSeifert Over a year ago

I'm glad you didn't delete it after all. It does fails for 3D (it converted it to 2D) inputs but it's still useful to have as quick(&dirty) approach.

|

Collectives™ on Stack Overflow

Parsing the string-representation of a numpy array

2 Answers 2

3 Comments

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related