0

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.

I have tried the following:

arr = df[['column1', 'column2']].values
thelist= []
    for ix, iy in np.ndindex(arr.shape):
        if arr[ix, iy] not in thelist:
            thelist.append(edges[ix, iy])

This works but it is taking too long. The dataframe contains around 30 million rows.

Example:

  column1 column2 
1   adr1   adr2   
2   adr1   adr2   
3   adr3   adr4   
4   adr4   adr5   

Should generate the list with the values:

[adr1, adr2, adr3, adr4, adr5]

Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.

5
  • 2
    np.unique(df.values). The default is to flatten arrays, so this does exactly what you want. Commented Feb 21, 2019 at 18:56
  • list(np.unique(df.to_numpy()) Commented Feb 21, 2019 at 18:59
  • Possible duplicate of pandas unique values multiple columns Commented Feb 21, 2019 at 19:00
  • @ALollz is it normal that the contiguous order is not preserved? I need it to be contiguous. Commented Feb 21, 2019 at 19:57
  • 1
    @alejo then try pd.unique(df.values.ravel()). pd.unique preserves order, while np.unique sorts Commented Feb 21, 2019 at 20:08

2 Answers 2

2

@ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))

Sign up to request clarification or add additional context in comments.

Comments

1

You can use just np.unique(df) (maybe this is the shortest version).

Formally, the first parameter of np.unique should be an array_like object, but as I checked, you can also pass just a DataFrame.

Of course, if you want just plain list not a ndarray, write np.unique(df).tolist().

Edit following your comment

If you want the list unique but in the order of appearance, write:

pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()

Operation order:

  • reshape changes the source array into a single column.
  • Then a DataFrame is created, with default column name = 0.
  • Then [0] takes just this (the only) column.
  • drop_duplicates acts exactly what the name says.
  • And the last step: tolist converts to a plain list.

2 Comments

Thanks @Valdi_Bo, is it normal that the contiguous order is not preserved? I need it to be contiguous.
Do you mean the order of appearance in the source table? The documentation states that if no axis has been given, then the input array is flattened (not sure about the order). Another step when the order can be changed is the np.unique function itself. It seems that the result has been sorted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.