0

I'm trying to create a NumPy array for the "label" column from a pandas data-frame.

My df:

      label                                             vector
0         0   1:0.044509422 2:-0.03092437 3:0.054365806 4:-...
1         0   1:-0.007471546 2:-0.062329583 3:0.012314787 4...
2         0   1:-0.009525825 2:0.0028720177 3:0.0029517233 ...
3         1   1:-0.0040618754 2:-0.03754585 3:0.008025528 4...
4         0   1:0.039150625 2:-0.08689039 3:0.09603256 4:0....
...     ...                                                ...
59996     1   1:0.01846487 2:-0.012882819 3:0.035375785 4:-...
59997     1   1:0.01435293 2:-0.00683616 3:0.009475072 4:-0...
59998     1   1:0.018322088 2:-0.017116712 3:0.013021051 4:...
59999     0   1:0.014471473 2:-0.023652712 3:0.031210974 4:...
60000     1   1:0.00888336 2:-0.006902163 3:0.022569133 4:0...

As you can see I'm having two col: label and vector. For the col label I'm using this solution:

y = pd.DataFrame([df.label])

print(y.astype(float).to_numpy())

print(y)

As result I'm having this:


   0     1     2     3     4     5     6     7     8     9     10    11    12    13    14    15     ... 59985 59986 59987 59988 59989 59990 59991 59992 59993 59994 59995 59996 59997 59998 59999 60000
label     0     0     0     1     0     0     0     0     0     0     0     1     0     1     0     1  ...     1     1     1     0     1     0     0     1     1     1     1     1     1     1     0     1

[1 rows x 60001 columns]

However, the expected output should be:

     0         
0    0
1    0
2    0
3    1

... ...

[60001 rows x 1 columns]  

Instead of an array with [1 rows x 60001 columns] I would like to have an array with [60001 rows x 1 columns]

Thanks for your time

4
  • 1
    Your question was initially confusing, I've edited it for clarity. How about y = df[['label']].to_numpy(), does it do what you want? Commented Apr 20, 2020 at 0:18
  • Thanks for your reply but maybe I was not so clear posting my question. So, I’m having two col in my df: vector (X) and label (y). I would like to have two separate arrays. The vector is correctly represented. My problem is to transform the col label: My df is composed by 60001 records and now as output I’m having [1 rows x 60001 columns] but I would like to have [60001 rows x 1 columns] Commented Apr 20, 2020 at 0:21
  • 1
    Refer to my first comment, it should answer your question Commented Apr 20, 2020 at 0:24
  • maybe df[['label']].values ? Commented Apr 20, 2020 at 0:26

2 Answers 2

1

Instead of an array with [1 rows x 60001 columns] I would like to have an array with [60001 rows x 1 columns]: If I understand your issue correctly and you need to reshape your array use:

y = y.reshape(-1, 1)

This will convert your array into a shape that has one columns and will automatically fix the the number of rows for you (the dimension assigned with -1 is automatically calculated from the arrays size and other dimensions shape). So you can do either of these:

Your proposed way + reshape:

y = pd.DataFrame([df.label]).astype(float).to_numpy().reshape(-1, 1)

Or @cs95's suggested answer (which results in the same array):

y = df[['label']].astype(float).to_numpy()
Sign up to request clarification or add additional context in comments.

Comments

0

If you start with a dataframe

In [98]: df                                                                                            
Out[98]: 
   a  b   c   d
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

and select a column by name, you get a Series:

In [99]: df.a                            # df['a']                                                              
Out[99]: 
0    0
1    4
2    8
Name: a, dtype: int64
In [100]: type(_)                                                                                      
Out[100]: pandas.core.series.Series

the to_numpy of the series is a 1d array:

In [101]: df.a.to_numpy()                                                                              
Out[101]: array([0, 4, 8])
In [102]: _.shape                                                                                      
Out[102]: (3,)

But you've taken the Series, and turned it back into a dataframe:

In [103]: y = pd.DataFrame([df.a])                                                                     
In [104]: y                                                                                            
Out[104]: 
   0  1  2
a  0  4  8

Was the your intention? In any case, the extracted array is 2d:

In [105]: y.to_numpy()                                                                                 
Out[105]: array([[0, 4, 8]])
In [106]: _.shape                                                                                      
Out[106]: (1, 3)

We can reshape it, or take its 'transpose':

In [107]: __.T                # reshape(3,1)                                                                         
Out[107]: 
array([[0],
       [4],
       [8]])

If we omit the [] from the y expression, we get a different dataframe and the desired 'column' array:

In [109]: pd.DataFrame(df.a)                                                                           
Out[109]: 
   a
0  0
1  4
2  8
In [110]: pd.DataFrame(df.a).to_numpy()                                                                
Out[110]: 
array([[0],
       [4],
       [8]])

another option is to select column with a list:

In [111]: df[['a']]                                                                                    
Out[111]: 
   a
0  0
1  4
2  8

A Series is the pandas version of a 1d numpy array. It has row indices, but no column ones. A DataFrame is 2d, with rows and columns.

Keep in mind that a numpy array can have shapes (3,), (1,3) and (3,1), all with the same 3 elements.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.