Concatenate numpy arrays to two arrays using pandas, numpy or other

Question

I have a series of numpy arrays generated for example like this:

import random
N = 5
data = [[random.random() for i in range(N)] for j in range(N)]
names = ['a','b','c','d','e']
df = pd.DataFrame(data)
df = df.transpose()
df.columns = names

ie:

a    b    c    d    e
0.01 0.03 0.01 0.2  0.04
0.2  0.01 0.02 0.01 0.1
...

and I would like to format it so that it looks like this:

name    value
a       0.01
b       0.03
c       0.01
d       0.2
e       0.04
a       0.2
b       0.01
....

(order of data is not important)

I have tried pandas dataframe transpose:

df = pd.DataFrame(data)
df = df.transpose()
df.columns = names

but the result looks like this:

a    0.1   0.2  0.01 0.2
b    0.3   0.1  0.2  0.01
....

Any idea on how to reformat the numpy arrays/pandas dataframe to have two columns of data?

code that generates "data" is incomplete

joel.wilson
– joel.wilson

2016-12-03 09:23:57 +00:00
Commented Dec 3, 2016 at 9:23 — joel.wilson
– joel.wilson, Commented Dec 3, 2016 at 9:23

jezrael · Accepted Answer · 2016-12-03 10:37:06Z

You can use numpy.tile for repeat column names and numpy.ravel for flatten values of DataFrame:

#random dataframe
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
   A  B  C  D  E
0  8  8  3  7  7
1  0  4  2  5  2
2  2  2  1  0  8
3  4  0  9  6  2
4  4  1  5  3  4

df2 = pd.DataFrame({
        "name": np.tile(df.columns, len(df.index)),
        "value": df.values.ravel()})
print (df2)        
   name  value
0     A      8
1     B      8
2     C      3
3     D      7
4     E      7
5     A      0
6     B      4
7     C      2
8     D      5
9     E      2
10    A      2
11    B      2
12    C      1
13    D      0
14    E      8
15    A      4
16    B      0
17    C      9
18    D      6
19    E      2
20    A      4
21    B      1
22    C      5
23    D      3
24    E      4

Timings (len(df) = 1M):

#random dataframe
np.random.seed(100)
N = 1000000
df = pd.DataFrame(np.random.randint(10, size=(N,5)), columns=list('abcde'))
print (df)

In [86]: %timeit (pd.DataFrame({"name": np.tile(df.columns, len(df.index)),"value": df.values.ravel()}))
10 loops, best of 3: 84.8 ms per loop

In [87]: %timeit (pd.DataFrame(np.column_stack((np.tile(df.columns, df.shape[0]), df.values.reshape(-1,1))), columns=['name', 'value']))
10 loops, best of 3: 196 ms per loop

In [88]: %timeit (df.stack().reset_index(level=0, drop=True).reset_index(name='value').rename(columns={'index':'name'}))
1 loop, best of 3: 344 ms per loop

If need output numpy array add numpy.column_stack:

print (np.column_stack((np.tile(df.columns, len(df.index)), df.values.ravel())))
[['a' 8]
 ['b' 8]
 ['c' 3]
 ['d' 7]
 ['e' 7]
 ['a' 0]
 ['b' 4]
 ['c' 2]
 ['d' 5]
 ['e' 2]
 ['a' 2]
 ['b' 2]
 ['c' 1]
 ['d' 0]
 ['e' 8]
 ['a' 4]
 ['b' 0]
 ['c' 9]
 ['d' 6]
 ['e' 2]
 ['a' 4]
 ['b' 1]
 ['c' 5]
 ['d' 3]
 ['e' 4]]

Nice solution! scales well too. But do note that np.column_stack doesn't preserve the dtypes.

MaxU - stand with Ukraine · Accepted Answer · 2016-12-03 09:25:34Z

is that what you want?

In [11]: df
Out[11]:
          a         b         c         d         e
0  0.791796  0.428642  0.887860  0.803709  0.860545
1  0.230401  0.105232  0.617007  0.557678  0.590459
2  0.448462  0.314422  0.207188  0.785642  0.022271
3  0.075631  0.707029  0.111538  0.769387  0.174297
4  0.707566  0.299966  0.197642  0.145841  0.231135

In [12]: df.stack().reset_index(level=0, drop=True).reset_index()
Out[12]:
   index         0
0      a  0.791796
1      b  0.428642
2      c  0.887860
3      d  0.803709
4      e  0.860545
5      a  0.230401
6      b  0.105232
7      c  0.617007
8      d  0.557678
9      e  0.590459
10     a  0.448462
11     b  0.314422
12     c  0.207188
13     d  0.785642
14     e  0.022271
15     a  0.075631
16     b  0.707029
17     c  0.111538
18     d  0.769387
19     e  0.174297
20     a  0.707566
21     b  0.299966
22     c  0.197642
23     d  0.145841
24     e  0.231135

dragon2fly · Accepted Answer · 2016-12-03 10:23:40Z

You just need to concat all the columns in df together. Since columns' name are different, you need to set them with the same name. If not, pandas will add new columns into the concat result.

import random
import pandas as pd

N = 5
data = [[random.random() for i in range(N)] for j in range(N)]
names = ['a','b','c','d','e']

df = pd.DataFrame(data)
df.columns = names
df = df.transpose()
print df

#           0         1         2         3         4
# a  0.643042  0.061476  0.415979  0.209272  0.394414
# b  0.175363  0.580336  0.056173  0.468121  0.388956
# c  0.096257  0.570860  0.516667  0.892087  0.956790
# d  0.082906  0.340805  0.466074  0.010123  0.293006
# e  0.430240  0.759413  0.083779  0.442159  0.434603

df_col=[df[[i]] for i in range(len(df))]    # separate columns in df
for col in df_col:
    col.columns=['value']                   # change the columns' name

res = pd.concat(df_col)                     # concat them all together
res.index.names=['name']

print res

#          value
# name          
# a     0.643042
# b     0.175363
# c     0.096257
# d     0.082906
# e     0.430240
# a     0.061476
# b     0.580336
# c     0.570860
# d     0.340805
# e     0.759413
# a     0.415979
# b     0.056173
# c     0.516667
# d     0.466074
# e     0.083779
# a     0.209272
# b     0.468121
# c     0.892087
# d     0.010123
# e     0.442159
# a     0.394414
# b     0.388956
# c     0.956790
# d     0.293006
# e     0.434603

Collectives™ on Stack Overflow

Concatenate numpy arrays to two arrays using pandas, numpy or other

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related