1

I have data with a large number of columns:

df:
ID   col1   col2   col3 ... col100
1    a      x      0        1  
1    a      x      1        1
2    a      y      1        1
4    a      z      1        0  
...
98   a      z      1        1
100  a      x      1        0

I want to fill in the missing ID values with a default value that indicate that the data is missing here. For example here it would be ID 3 and hypothetically speaking lets say the missing row data looks like ID 100

ID   col1   col2  col3 ... col100
3    a      x     1        0
99   a      x     1        0

Expected output:

df:
ID   col1   col2   col3 ... col100
1    a      x      0        1  
1    a      x      1        1
2    a      y      1        1
3    a      x      1        0
4    a      z      1        0  
...
98   a      z      1        1
99   a      x     1        0
100  a      x      1        0

I'm also ok with the 3 and 99 being at the bottom.

I have tried several ways of appending new rows:

noresponse = df[filterfornoresponse].head(1).copy() #assume that this will net us row 100

for i in range (1, maxID):
    if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
        temp = noresponse.copy()
        temp['ID'] = i
        df.append(temp, ignore_index = True)

This method doesn't seem to append anything.

I have also tried

pd.concat([df, temp], ignore_index = True)

instead of df.append

I have also tried adding the rows to a list noresponserows with the intention of concating the list with df:

noresponserows = []

for i in range (1, maxID):
    if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
        temp = noresponse.copy()
        temp['ID'] = i
        noresponserows.append(temp)

But here the list always ends up with only 1 row when in my data I know there are more than one rows that need to be appended.

I'm not sure why I am having trouble appending more than once instance of noresponse into the list, and why I can't directly append to a dataframe. I feel like I am missing something here.

I think it might have to do with me taking a copy of a row in the df vs constructing a new one. The reason why I take a copy of a row to get noresponse is because there are a large amount of columns so it is easier to just take an existing row.

5
  • What do you mean by "missing ID value"? Commented Dec 18, 2021 at 23:41
  • You need to do df.append(..., inplace=True) or df = df.append(...) because df.append() returns a new dataframe by default. Commented Dec 18, 2021 at 23:42
  • 1
    Will you please provide a sample dataframe containing your expected output? Commented Dec 18, 2021 at 23:44
  • added expected output @richardec, what I mean is that I want to have Ids span all contiguous values from 1 to 100 so here 3 and 99 are missing and I want to add them Commented Dec 18, 2021 at 23:52
  • See my answer, @SemicolonExpected. Commented Dec 18, 2021 at 23:53

1 Answer 1

1

Say you have a dataframe like this:

>>> df
  col1 col2  col100  ID
0    a    x       0   1
1    a    y       3   2
2    a    z       1   4

First, set the ID column to be the index:

>>> df = df.set_index('ID')
>>> df
   col1 col2  col100
ID                  
1     a    x       0
2     a    y       3
4     a    z       1

Now you can use df.loc to easily add rows.

Let's select the last row as the default row:

>>> default_row = df.iloc[-1]
>>> default_row
col1      a
col2      z
col100    1
Name: 4, dtype: object

We can add it right into the dataframe at ID 3:

>>> df.loc[3] = default_row
>>> df
   col1 col2  col100
ID                  
1     a    x       0
2     a    y       3
4     a    z       1
3     a    z       1

Then use sort_index to sort the rows lexicographically by index:

>>> df = df.sort_index()
>>> df
   col1 col2  col100
ID                  
1     a    x       0
2     a    y       3
3     a    z       1
4     a    z       1

And, optionally, reset the index:

>>> df = df.reset_index()
>>> df
   ID col1 col2  col100
0   1    a    x       0
1   2    a    y       3
2   3    a    z       1
3   4    a    z       1

Sign up to request clarification or add additional context in comments.

2 Comments

I dont think I can use ID as an index because I have duplicate values, in fact it is expected that there are multiple instances of each "ID". For example in my example I have two instances of 1
Yes, you can use the index in that case. It's totally valid for the index to have multiple values. I know it's counter-intuitive, but it's the same thing for columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.