4

I have a dataframe as below:

import pandas as pd
import numpy as np

df=pd.DataFrame({'id':[0,1,2,4,5],
                'A':[0,1,0,1,0],
                'B':[None,None,1,None,None]})
   id  A    B
0   0  0  NaN
1   1  1  NaN
2   2  0  1.0
3   4  1  NaN
4   5  0  NaN

Notice that the vast majority of value in B column is NaN

id column increment by 1,so one row between id 2 and 4 is missing.
The missing row which need insert is the same as the previous row, except for id column.

So for example the result is

    id  A   B
0   0   0.0 NaN
1   1   1.0 NaN
2   2   0.0 1.0
3   3   0.0 1.0 <-add row here
4   4   1.0 NaN
5   5   0.0 NaN

I can do this on A column,but I don't know how to deal with B column as ffill will fill 1.0 at row 4 and 5,which is incorrect

step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()

EDIT:
sorry,I forget one sutiation.
B column will have different values.
When DataFrame is as below:

   id  A    B
0   0  0  NaN
1   1  1  NaN
2   2  0  1.0
3   4  1  NaN
4   5  0  NaN
5   6  1  2.0
6   9  0  NaN
7   10 1  NaN

the result would be:

   id  A    B
0   0  0  NaN
1   1  1  NaN
2   2  0  1.0
3   3  0  1.0
4   4  1  NaN
5   5  0  NaN
6   6  1  2.0
7   7  1  2.0
8   8  1  2.0
9   9  0  NaN
10  10 1  NaN

3 Answers 3

4

Do the changes keep the original id , and with update isin

s=df.id.copy() #change 1
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()

df.B.update(df.B.ffill().mask(df.id.isin(s))) # change two
df
   id    A    B
0   0  0.0  NaN
1   1  1.0  NaN
2   2  0.0  1.0
3   3  0.0  1.0
4   4  1.0  NaN
5   5  0.0  NaN
Sign up to request clarification or add additional context in comments.

Comments

1

If I understand in the right way, here are some sample code.

new_df = pd.DataFrame({
    'new_id': [i for i in range(df['id'].max() + 1)],
})

df = df.merge(new_df, how='outer', left_on='id', right_on='new_id')
df = df.sort_values('new_id')

df = df.ffill()

df = df.drop(columns='id')

df
    A   B   new_id
0   0.0 NaN 0
1   1.0 NaN 1
2   0.0 1.0 2
5   0.0 1.0 3
3   1.0 1.0 4
4   0.0 1.0 5

Comments

1

Try this

df=pd.DataFrame({'id':[0,1,2,4,5],
                'A':[0,1,0,1,0],
                'B':[None,None,1,None,None]})


missingid = list(set(range(df.id.min(),df.id.max())) - set(df.id.tolist()))
for i in missingid:
    df.loc[len(df)] = np.concatenate((np.array([i]),df[df.id==i-1][["A","B"]].values[0]))

df=df.sort_values("id").reset_index(drop=True)

output

    id    A    B
0  0.0  0.0  NaN
1  1.0  1.0  NaN
2  2.0  0.0  1.0
3  3.0  0.0  1.0
4  4.0  1.0  NaN
5  5.0  0.0  NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.