Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

Question

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.

I have tried the following:

arr = df[['column1', 'column2']].values
thelist= []
    for ix, iy in np.ndindex(arr.shape):
        if arr[ix, iy] not in thelist:
            thelist.append(edges[ix, iy])

This works but it is taking too long. The dataframe contains around 30 million rows.

Example:

  column1 column2 
1   adr1   adr2   
2   adr1   adr2   
3   adr3   adr4   
4   adr4   adr5

Should generate the list with the values:

[adr1, adr2, adr3, adr4, adr5]

Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.

np.unique(df.values). The default is to flatten arrays, so this does exactly what you want. — ALollz
– ALollz, Commented Feb 21, 2019 at 18:56
@ALollz is it normal that the contiguous order is not preserved? I need it to be contiguous. — alejo
– alejo, Commented Feb 21, 2019 at 19:57
@alejo then try pd.unique(df.values.ravel()). pd.unique preserves order, while np.unique sorts — ALollz
– ALollz, Commented Feb 21, 2019 at 20:08

meW · Accepted Answer · 2019-02-21 18:59:00Z

2

@ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))

answered Feb 21, 2019 at 18:59

meW

3,97710 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Valdi_Bo · Accepted Answer · 2019-02-21 20:21:54Z

1

You can use just np.unique(df) (maybe this is the shortest version).

Formally, the first parameter of np.unique should be an array_like object, but as I checked, you can also pass just a DataFrame.

Of course, if you want just plain list not a ndarray, write np.unique(df).tolist().

Edit following your comment

If you want the list unique but in the order of appearance, write:

pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()

Operation order:

reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.

edited Feb 21, 2019 at 20:21

answered Feb 21, 2019 at 19:40

Valdi_Bo

31.1k4 gold badges29 silver badges45 bronze badges

2 Comments

alejo Over a year ago

Thanks @Valdi_Bo, is it normal that the contiguous order is not preserved? I need it to be contiguous.

Valdi_Bo Over a year ago

Do you mean the order of appearance in the source table? The documentation states that if no axis has been given, then the input array is flattened (not sure about the order). Another step when the order can be changed is the np.unique function itself. It seems that the result has been sorted.

Collectives™ on Stack Overflow

Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

2 Answers 2

Comments

Edit following your comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Edit following your comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related