6

Given pd.DataFrame with 0.0 < values < 1.0, I would like to convert it to binary values 0 /1 according to defined threshold eps = 0.5,

      0     1     2
0  0.35  0.20  0.81
1  0.41  0.75  0.59
2  0.62  0.40  0.94
3  0.17  0.51  0.29

Right now, I only have this for loop which takes quite long time for large dataset:

import numpy as np
import pandas as pd

data = np.array([[.35, .2, .81],[.41, .75, .59],
                [.62, .4, .94], [.17, .51, .29]])

df = pd.DataFrame(data, index=range(data.shape[0]), columns=range(data.shape[1]))
eps = .5
b = np.zeros((df.shape[0], df.shape[1]))
for i in range(df.shape[0]):
    for j in range(df.shape[1]):
        if df.loc[i,j] < eps:
            b[i,j] = 0
        else:
            b[i,j] = 1
df_bin = pd.DataFrame(b, columns=df.columns, index=df.index)

Does anybody know a more effective way to convert to binary values?

     0    1    2
0  0.0  0.0  1.0
1  0.0  1.0  1.0
2  1.0  0.0  1.0
3  0.0  1.0  0.0

Thanks,

0

3 Answers 3

9

df.round

>>> df.round()

np.round

>>> np.round(df)

astype

>>> df.ge(0.5).astype(int)

All which yield

     0    1    2
0  0.0  0.0  1.0
1  0.0  1.0  1.0
2  1.0  0.0  1.0
3  0.0  1.0  0.0

Note: round works here because it automatically sets the threshold for .5 between two integers. For custom thresholds, use the 3rd solution

Sign up to request clarification or add additional context in comments.

Comments

8

Or you can use np.where() and assign the values to the underlying array:

df[:]=np.where(df<0.5,0,1)

   0  1  2
0  0  0  1
1  0  1  1
2  1  0  1
3  0  1  0

Comments

4

Since we have a quite a some answers, which are all using different methods, I was curious about the speed comparison. Thought I share:

# create big test dataframe
dfbig = pd.concat([df]*200000, ignore_index=True)
print(dfbig.shape)

(800000, 3)
# pandas round()
%%timeit 
dfbig.round()

101 ms ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numpy round()
%%timeit
np.round(dfbig)

104 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# pandas .ge & .astype
%%timeit
dfbig.ge(0.5).astype(int)

9.32 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# numpy.where
%%timeit
np.where(dfbig<0.5, 0, 1)

21.5 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Conlusion:

  1. pandas ge & astype
  2. np.where
  3. np.round
  4. pandas round

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.