-1

I have dataframe eg. like below

Event['EVENT_ID'] = [ 4162, 4161, 4160, 4159,4158, 4157, 4156, 4155, 4154]

need to convert each row word to binary.

Event['b']=bin(Event['EVENT_ID']) doesn't work 
TypeError: cannot convert the series to <class 'int'>

expected new column with binary, remove 0b and split the column to 16 separate column

bin(4162) = '0b1000001000010'

3 Answers 3

2

You can also work with numpy, much faster than pandas.

Edit: faster numpy using view Couple of tricks here:

  • Work only with the column of interest
  • Convert the underlaying array to uint16, to ensure compatibility with any integer input
  • Swapbytes to have a proper H,L order (at least on my architecture)
  • Split H,L without actually moving any data with view
  • Run unpackbits and reshape accordingly

My machine requires a byteswap to have the bytes of the uint16 in the proper place. Note that this aproach requires to have the data as int16/uint16, while the other one would work for int64 as well.

import pandas as pd
import numpy as np

df = pd.DataFrame({'EVENT_ID': [ 4162, 4161, 4160, 4159,4158, 4157, 4156, 4155, 4154]}, dtype='uint16')

zz=np.unpackbits(df.EVENT_ID.values.astype('uint16').byteswap().view('uint8')).reshape(-1,16)
df3 = pd.concat([df,pd.DataFrame(zz)],axis=1)

print(f"{df3 =}")

df3 =   EVENT_ID  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15
0      4162  0  0  0  1  0  0  0  0  0  1   0   0   0   0   1   0
1      4161  0  0  0  1  0  0  0  0  0  1   0   0   0   0   0   1
2      4160  0  0  0  1  0  0  0  0  0  1   0   0   0   0   0   0
3      4159  0  0  0  1  0  0  0  0  0  0   1   1   1   1   1   1
4      4158  0  0  0  1  0  0  0  0  0  0   1   1   1   1   1   0
5      4157  0  0  0  1  0  0  0  0  0  0   1   1   1   1   0   1
6      4156  0  0  0  1  0  0  0  0  0  0   1   1   1   1   0   0
7      4155  0  0  0  1  0  0  0  0  0  0   1   1   1   0   1   1
8      4154  0  0  0  1  0  0  0  0  0  0   1   1   1   0   1   0

older proposed method:

lh = np.unpackbits((df.values & 0xFF).astype('uint8')).reshape(-1,8)
uh = np.unpackbits((df.values >> 8).astype('uint8')).reshape(-1,8)

df2 = pd.concat([df, pd.DataFrame(np.concatenate([uh,lh],axis=1),index=df.index)],axis=1)

Benchmark: numpy is orders of magnitude faster than pandas" For 1M points:

  • numpy view: 35ms for 1million uint64 points
  • numpy low/high: 50ms
  • pandas list bin: 1.78s
  • pandas apply format + list: 1.97s
  • pandas apply lambda: 6.08s
df = pd.DataFrame({'EVENT_ID': (np.random.random(int(1e6))*65000).astype('uint16')})

pandas apply format list

In [13]: %timeit df2 = df.join(pd.DataFrame(df['EVENT_ID'].apply('{0:b}'.format).apply(list).tolist()))
1.97 s ± 42.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pandas list bin

In [10]: %%timeit
    ...: binary_values = pd.DataFrame([list(bin(x)[2:]) for x in df['EVENT_ID']])
    ...: df2 = df.join(binary_values)
    ...:
    ...:
1.78 s ± 53.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pandas3 apply lambda

In [5]: %%timeit
   ...: for i in range(16):
   ...:     df[f"bit{i}"] = df["EVENT_ID"].apply(lambda x: x & 1 << i).astype(bool).astype(int)
   ...:
6.08 s ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy

In [14]: %%timeit
    ...: lh = np.unpackbits((df.values & 0xFF).astype('uint8')).reshape(-1,8)
    ...: uh = np.unpackbits((df.values >> 8).astype('uint8')).reshape(-1,8)
    ...: df3=pd.concat([df, pd.DataFrame(np.concatenate([uh,lh],axis=1),index=df.index)],axis=1)
    ...:
    ...:
49.9 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

Comments

1

Edit

Although my idea is there, this implementation is in fact much slower than all the other answers. Please see @ZaeroDivide's answer and comments below.

Original Answer

I don't think using the bin function and working with the str type is particularly efficient. Please consider using bitmasks.

for i in range(16):
    df[f"bit{i}"] = df["EVENT_ID"].apply(lambda x: x & 1 << i).astype(bool).astype(int)

Testing with your data, I have the following results

   EVENT_ID              B  bit0  bit1  bit2  bit3  bit4  bit5  bit6  bit7  \
0      4162  1000001000010     0     1     0     0     0     0     1     0   
1      4161  1000001000001     1     0     0     0     0     0     1     0   
2      4160  1000001000000     0     0     0     0     0     0     1     0   
3      4159  1000000111111     1     1     1     1     1     1     0     0   
4      4158  1000000111110     0     1     1     1     1     1     0     0   
5      4157  1000000111101     1     0     1     1     1     1     0     0   
6      4156  1000000111100     0     0     1     1     1     1     0     0   
7      4155  1000000111011     1     1     0     1     1     1     0     0   
8      4154  1000000111010     0     1     0     1     1     1     0     0   

   bit8  bit9  bit10  bit11  bit12  bit13  bit14  bit15  
0     0     0      0      0      1      0      0      0  
1     0     0      0      0      1      0      0      0  
2     0     0      0      0      1      0      0      0  
3     0     0      0      0      1      0      0      0  
4     0     0      0      0      1      0      0      0  
5     0     0      0      0      1      0      0      0  
6     0     0      0      0      1      0      0      0  
7     0     0      0      0      1      0      0      0  
8     0     0      0      0      1      0      0      0 

3 Comments

this implementation is one of the most inefficient: There are two nested loops, the for and apply/lambda. It is 100x slower than numpy and 3x slower than other pandas answers
My idea was to use a bitwise operation instead of string manipulation, but I am unable to locate the right function. I upvoted your answer.
Usually "low level" operations are available in numpy. Also, operating in the underlaying numpy array is tipically faster. The problem of your answer is that for+ apply lambda is tipically the slowest implementation, even using "efficient" bitwise operations. See that other pandas answers using strings are actually faster. Strictly speaking however, you are solving the problem :)
0

You can try turn the int Series to binary by applying '{0:b}'.format, then split the list column to multiple columns with pd.DataFrame

df = df.join(pd.DataFrame(df['EVENT_ID'].apply('{0:b}'.format).apply(list).tolist()))
print(df)

   EVENT_ID  0  1  2  3  4  5  6  7  8  9 10 11 12
0      4162  1  0  0  0  0  0  1  0  0  0  0  1  0
1      4161  1  0  0  0  0  0  1  0  0  0  0  0  1
2      4160  1  0  0  0  0  0  1  0  0  0  0  0  0
3      4159  1  0  0  0  0  0  0  1  1  1  1  1  1
4      4158  1  0  0  0  0  0  0  1  1  1  1  1  0
5      4157  1  0  0  0  0  0  0  1  1  1  1  0  1
6      4156  1  0  0  0  0  0  0  1  1  1  1  0  0
7      4155  1  0  0  0  0  0  0  1  1  1  0  1  1
8      4154  1  0  0  0  0  0  0  1  1  1  0  1  0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.