How can I get the index or column of a DataFrame as a NumPy array or Python list?
-
Also, related: Convert pandas dataframe to NumPy arraycs95– cs952019-02-05 05:49:09 +00:00Commented Feb 5, 2019 at 5:49
-
3Does this answer your question? Convert pandas dataframe to NumPy arrayAMC– AMC2020-01-07 19:45:10 +00:00Commented Jan 7, 2020 at 19:45
-
2NOTE: Having to convert Pandas DataFrame to an array (or list) like this can be indicative of other issues. I strongly recommend ensuring that a DataFrame is the appropriate data structure for your particular use case, and that Pandas does not include any way of performing the operations you're interested in.AMC– AMC2020-01-07 20:22:17 +00:00Commented Jan 7, 2020 at 20:22
-
Concerning my vote to reopen this question: Technically, a pandas series is not the same as a pandas dataframe. The answers may be the same, but the questions are definitely different.Serge Stroobandt– Serge Stroobandt2021-08-25 09:51:50 +00:00Commented Aug 25, 2021 at 9:51
8 Answers
To get a NumPy array, you should use the values attribute:
In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c']); df
A B
a 1 4
b 2 5
c 3 6
In [2]: df.index.values
Out[2]: array(['a', 'b', 'c'], dtype=object)
This accesses how the data is already stored, so there isn't any need for a conversion.
Note: This attribute is also available for many other pandas objects.
In [3]: df['A'].values
Out[3]: Out[16]: array([1, 2, 3])
To get the index as a list, call tolist:
In [4]: df.index.tolist()
Out[4]: ['a', 'b', 'c']
And similarly, for columns.
2 Comments
.values is deprecated, .to_numpy() is the suggested replacement if you want a NumPy array. Can you expand on This accesses how the data is already stored, so there's no need for a conversion?You can use df.index to access the index object and then get the values in a list using df.index.tolist(). Similarly, you can use df['col'].tolist() for Series.
3 Comments
df.index.values.tolist()pandas >= 0.24
Deprecate your usage of .values in favour of these methods!
From v0.24.0 onwards, we will have two brand spanking new, preferred methods for obtaining NumPy arrays from Index, Series, and DataFrame objects: they are to_numpy(), and .array. Regarding usage, the docs mention:
We haven’t removed or deprecated
Series.valuesorDataFrame.values, but we highly recommend and using.arrayor.to_numpy()instead.
See this section of the v0.24.0 release notes for more information.
df.index.to_numpy()
# array(['a', 'b'], dtype=object)
df['A'].to_numpy()
# array([1, 4])
By default, a view is returned. Any modifications made will affect the original.
v = df.index.to_numpy()
v[0] = -1
df
A B
-1 1 2
b 4 5
If you need a copy instead, use to_numpy(copy=True);
v = df.index.to_numpy(copy=True)
v[-1] = -123
df
A B
a 1 2
b 4 5
Note that this function also works for DataFrames (while .array does not).
array Attribute
This attribute returns an ExtensionArray object that backs the Index/Series.
pd.__version__
# '0.24.0rc1'
# Setup.
df = pd.DataFrame([[1, 2], [4, 5]], columns=['A', 'B'], index=['a', 'b'])
df
A B
a 1 2
b 4 5
<!- ->
df.index.array
# <PandasArray>
# ['a', 'b']
# Length: 2, dtype: object
df['A'].array
# <PandasArray>
# [1, 4]
# Length: 2, dtype: int64
From here, it is possible to get a list using list:
list(df.index.array)
# ['a', 'b']
list(df['A'].array)
# [1, 4]
or, just directly call .tolist():
df.index.tolist()
# ['a', 'b']
df['A'].tolist()
# [1, 4]
Regarding what is returned, the docs mention,
For
SeriesandIndexes backed by normal NumPy arrays,Series.arraywill return a newarrays.PandasArray, which is a thin (no-copy) wrapper around anumpy.ndarray.arrays.PandasArrayisn’t especially useful on its own, but it does provide the same interface as any extension array defined in pandas or by a third-party library.
So, to summarise, .array will return either
- The existing
ExtensionArraybacking the Index/Series, or - If there is a NumPy array backing the series, a new
ExtensionArrayobject is created as a thin wrapper over the underlying array.
Rationale for adding TWO new methods
These functions were added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with
.valuesit was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (likeCategorical). For example, withPeriodIndex,.valuesgenerates a newndarrayof period objects each time. [...]
These two functions aim to improve the consistency of the API, which is a major step in the right direction.
Lastly, .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
1 Comment
S = pd.Series( [3, 4] ); np.asarray( S ) is S.values surprised me; would you know if this is documented anywhere ? (numpy 1.21.5, pandas 1.3.5)Since pandas v0.13 you can also use get_values:
df.index.get_values()
5 Comments
get_values just calls .values. It is more characters to type.I converted the pandas dataframe to list and then used the basic list.index(). Something like this:
dd = list(zone[0]) #Where zone[0] is some specific column of the table
idx = dd.index(filename[i])
You have you index value as idx.
1 Comment
Below is a simple way to convert a dataframe column into a NumPy array.
df = pd.DataFrame(somedict)
ytrain = df['label']
ytrain_numpy = np.array([x for x in ytrain['label']])
ytrain_numpy is a NumPy array.
I tried with to.numpy(), but it gave me the below error:
TypeError: no supported conversion for types: (dtype('O'),)* while doing Binary Relevance classfication using Linear SVC.
to.numpy() was converting the dataFrame into a NumPy array, but the inner element's data type was a list because of which the above error was observed.
1 Comment
to_numpy, though.