4

In my workflow there are multiple CSVs with four columns OID, value, count, unique_id. I am trying to figure how to generate incremental values under unique_id column. Using apply(), I can do something like df.apply(lambda x : x + 1) #where x = 0 and it will result in all values under unique_id as 1. However, I am confused on how to use apply() to generate incrementally values in each row for a specific column.

# Current Dataframe 
   OID  Value  Count  unique_id
0   -1      1      5          0
1   -1      2     46          0
2   -1      3     32          0
3   -1      4      3          0
4   -1      5     17          0

# Trying to accomplish
   OID  Value  Count  unique_id
0   -1      1      5          0
1   -1      2     46          1
2   -1      3     32          2
3   -1      4      3          3
4   -1      5     17          4

Sample code (I understand that the syntax is incorrect, but it is approximately what I am trying to accomplish):

def numbers():
    for index, row in RG_Res_df.iterrows():
        return index

RG_Res_df = RG_Res_df['unique_id'].apply(numbers)
1
  • 1
    you can just do df['unique_id'] = np.arange(df.shape[0]) Commented Mar 2, 2017 at 16:42

1 Answer 1

7

don't loop you can just directly assign a numpy array to generate the id, here using np.arange and pass the num of rows which will be df.shape[0]

In [113]:
df['unique_id'] = np.arange(df.shape[0])
df

Out[113]:
   OID  Value  Count  unique_id
0   -1      1      5          0
1   -1      2     46          1
2   -1      3     32          2
3   -1      4      3          3
4   -1      5     17          4

or pure pandas method using RangeIndex, here the default start is 0 so we only need to pass stop=df.shape[0]:

In [114]:
df['unique_id'] = pd.RangeIndex(stop=df.shape[0])
df

Out[114]:
   OID  Value  Count  unique_id
0   -1      1      5          0
1   -1      2     46          1
2   -1      3     32          2
3   -1      4      3          3
4   -1      5     17          4
Sign up to request clarification or add additional context in comments.

5 Comments

This worked beautifully. Is Numpy functions preferred over Pandas? or are they pretty comparable? Also, df['unique_id'] = pd.RangeIndex(stop=df.shape[0]) gives me AttributeError: 'module' object has no attribute 'RangeIndex'. Any idea? I was able to iterate using its index earlier.
you may need to add import pandas as pd also generally there isn't much different but numpy methods will be faster so it should be preferred where it does what you want
I found the problem, I am using an older version of Pandas at work. Also, could you point why wouldn't the following np.arange syntax: df['unique_id'] = np.arange(57) throws this error: ValueError: Length of values does not match length of index?
Well the error is telling you the lengths are different so what you tried will make an array from 0 to 56 making 57 rows is this correct?
Sort of. I just realized that I only need to generate the unique values on a select rows from 0 to 56. I assumed that since np.arange works much like range does in python, so if I were to provide a stop value (i.e. 57) it would only generate values for the said rows where start=0 by default. Sorry for the confusion!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.