2

I have a series of numpy arrays generated for example like this:

import random
N = 5
data = [[random.random() for i in range(N)] for j in range(N)]
names = ['a','b','c','d','e']
df = pd.DataFrame(data)
df = df.transpose()
df.columns = names

ie:

a    b    c    d    e
0.01 0.03 0.01 0.2  0.04
0.2  0.01 0.02 0.01 0.1
...

and I would like to format it so that it looks like this:

name    value
a       0.01
b       0.03
c       0.01
d       0.2
e       0.04
a       0.2
b       0.01
....

(order of data is not important)

I have tried pandas dataframe transpose:

df = pd.DataFrame(data)
df = df.transpose()
df.columns = names

but the result looks like this:

a    0.1   0.2  0.01 0.2
b    0.3   0.1  0.2  0.01
....

Any idea on how to reformat the numpy arrays/pandas dataframe to have two columns of data?

1
  • 1
    code that generates "data" is incomplete Commented Dec 3, 2016 at 9:23

3 Answers 3

2

You can use numpy.tile for repeat column names and numpy.ravel for flatten values of DataFrame:

#random dataframe
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
   A  B  C  D  E
0  8  8  3  7  7
1  0  4  2  5  2
2  2  2  1  0  8
3  4  0  9  6  2
4  4  1  5  3  4
df2 = pd.DataFrame({
        "name": np.tile(df.columns, len(df.index)),
        "value": df.values.ravel()})
print (df2)        
   name  value
0     A      8
1     B      8
2     C      3
3     D      7
4     E      7
5     A      0
6     B      4
7     C      2
8     D      5
9     E      2
10    A      2
11    B      2
12    C      1
13    D      0
14    E      8
15    A      4
16    B      0
17    C      9
18    D      6
19    E      2
20    A      4
21    B      1
22    C      5
23    D      3
24    E      4

Timings (len(df) = 1M):

#random dataframe
np.random.seed(100)
N = 1000000
df = pd.DataFrame(np.random.randint(10, size=(N,5)), columns=list('abcde'))
print (df)

In [86]: %timeit (pd.DataFrame({"name": np.tile(df.columns, len(df.index)),"value": df.values.ravel()}))
10 loops, best of 3: 84.8 ms per loop

In [87]: %timeit (pd.DataFrame(np.column_stack((np.tile(df.columns, df.shape[0]), df.values.reshape(-1,1))), columns=['name', 'value']))
10 loops, best of 3: 196 ms per loop

In [88]: %timeit (df.stack().reset_index(level=0, drop=True).reset_index(name='value').rename(columns={'index':'name'}))
1 loop, best of 3: 344 ms per loop

If need output numpy array add numpy.column_stack:

print (np.column_stack((np.tile(df.columns, len(df.index)), df.values.ravel())))
[['a' 8]
 ['b' 8]
 ['c' 3]
 ['d' 7]
 ['e' 7]
 ['a' 0]
 ['b' 4]
 ['c' 2]
 ['d' 5]
 ['e' 2]
 ['a' 2]
 ['b' 2]
 ['c' 1]
 ['d' 0]
 ['e' 8]
 ['a' 4]
 ['b' 0]
 ['c' 9]
 ['d' 6]
 ['e' 2]
 ['a' 4]
 ['b' 1]
 ['c' 5]
 ['d' 3]
 ['e' 4]]
Sign up to request clarification or add additional context in comments.

1 Comment

Nice solution! scales well too. But do note that np.column_stack doesn't preserve the dtypes.
1

is that what you want?

In [11]: df
Out[11]:
          a         b         c         d         e
0  0.791796  0.428642  0.887860  0.803709  0.860545
1  0.230401  0.105232  0.617007  0.557678  0.590459
2  0.448462  0.314422  0.207188  0.785642  0.022271
3  0.075631  0.707029  0.111538  0.769387  0.174297
4  0.707566  0.299966  0.197642  0.145841  0.231135

In [12]: df.stack().reset_index(level=0, drop=True).reset_index()
Out[12]:
   index         0
0      a  0.791796
1      b  0.428642
2      c  0.887860
3      d  0.803709
4      e  0.860545
5      a  0.230401
6      b  0.105232
7      c  0.617007
8      d  0.557678
9      e  0.590459
10     a  0.448462
11     b  0.314422
12     c  0.207188
13     d  0.785642
14     e  0.022271
15     a  0.075631
16     b  0.707029
17     c  0.111538
18     d  0.769387
19     e  0.174297
20     a  0.707566
21     b  0.299966
22     c  0.197642
23     d  0.145841
24     e  0.231135

Comments

1

You just need to concat all the columns in df together. Since columns' name are different, you need to set them with the same name. If not, pandas will add new columns into the concat result.

import random
import pandas as pd

N = 5
data = [[random.random() for i in range(N)] for j in range(N)]
names = ['a','b','c','d','e']

df = pd.DataFrame(data)
df.columns = names
df = df.transpose()
print df

#           0         1         2         3         4
# a  0.643042  0.061476  0.415979  0.209272  0.394414
# b  0.175363  0.580336  0.056173  0.468121  0.388956
# c  0.096257  0.570860  0.516667  0.892087  0.956790
# d  0.082906  0.340805  0.466074  0.010123  0.293006
# e  0.430240  0.759413  0.083779  0.442159  0.434603

df_col=[df[[i]] for i in range(len(df))]    # separate columns in df
for col in df_col:
    col.columns=['value']                   # change the columns' name

res = pd.concat(df_col)                     # concat them all together
res.index.names=['name']

print res

#          value
# name          
# a     0.643042
# b     0.175363
# c     0.096257
# d     0.082906
# e     0.430240
# a     0.061476
# b     0.580336
# c     0.570860
# d     0.340805
# e     0.759413
# a     0.415979
# b     0.056173
# c     0.516667
# d     0.466074
# e     0.083779
# a     0.209272
# b     0.468121
# c     0.892087
# d     0.010123
# e     0.442159
# a     0.394414
# b     0.388956
# c     0.956790
# d     0.293006
# e     0.434603

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.