8

How do you create a structured array from two columns in a DataFrame? I tried this:

df = pd.DataFrame(data=[[1,2],[10,20]], columns=['a','b'])
df

    a   b
0   1   2
1   10  20

x = np.array([([val for val in list(df['a'])],
               [val for val in list(df['b'])])])

But this gives me this:

array([[[ 1, 10],
        [ 2, 20]]])

But I wanted this:

[(1,2),(10,20)]

Thanks!

2
  • 1
    Because a package that I am using only takes input as a structured array. Why is this important? Commented Jul 11, 2018 at 8:00
  • Because there might be no need to create a list of tuple at all or it's also useful in terms of the way of creating that list of tuple. Commented Jul 11, 2018 at 8:03

3 Answers 3

13

There are a couple of methods. You may experience a loss in performance and functionality relative to regular NumPy arrays.

record array

You can use pd.DataFrame.to_records with index=False. Technically, this is a record array, but for many purposes this will be sufficient.

res1 = df.to_records(index=False)

print(res1)

rec.array([(1, 2), (10, 20)], 
          dtype=[('a', '<i8'), ('b', '<i8')])

structured array

Manually, you can construct a structured array via conversion to tuple by row, then specifying a list of tuples for the dtype parameter.

s = df.dtypes
res2 = np.array([tuple(x) for x in df.values], dtype=list(zip(s.index, s)))

print(res2)

array([(1, 2), (10, 20)], 
      dtype=[('a', '<i8'), ('b', '<i8')])

What's the difference?

Very little. recarray is a subclass of ndarray, the regular NumPy array type. On the other hand, the structured array in the second example is of type ndarray.

type(res1)                    # numpy.recarray
isinstance(res1, np.ndarray)  # True
type(res2)                    # numpy.ndarray

The main difference is record arrays facilitate attribute lookup, while structured arrays will yield AttributeError:

print(res1.a)
array([ 1, 10], dtype=int64)

print(res2.a)
AttributeError: 'numpy.ndarray' object has no attribute 'a'

Related: NumPy “record array” or “structured array” or “recarray”

Sign up to request clarification or add additional context in comments.

Comments

1

Use list comprehension for convert nested lists to tuples:

print ([tuple(x) for x in df.values.tolist()])
[(1, 2), (10, 20)]

Detail:

print (df.values.tolist())
[[1, 2], [10, 20]]

EDIT: You can convert by to_records and then to np.asarray, check link:

df = pd.DataFrame(data=[[True, 1,2],[False, 10,20]], columns=['a','b','c'])
print (df)
       a   b   c
0   True   1   2
1  False  10  20

print (np.asarray(df.to_records(index=False)))
[( True,  1,  2) (False, 10, 20)]

3 Comments

Neither are numpy structured arrays. Is it possible to do this?
@KimO - Can you explain more?
Yes. docs.scipy.org/doc/numpy/user/basics.rec.html The result should be: array([(x,y), (x2,y2)]
0

Here's a one-liner:

list(df.apply(lambda x: tuple(x), axis=1))

or

df.apply(lambda x: tuple(x), axis=1).values

4 Comments

This is not a numpy structured array.. is that possible?
edited it, is the second version what you are looking for?
YES! Is there are way to control the types of the fields? For example, if the dataFrame has two columns and I want the first to turn into a "binary class event indicator"? As explained here: scikit-survival.readthedocs.io/en/latest/generated/… Search for "structured array" .. So "bool" type
I strongly recommend you don't use object dtype for integers, even with structured arrays.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.