Dataframe indexing with for loop

Question

this is a follow up question to the one yesterday. I have a dataframe created by a csv file, and I am trying to compare a current and next value. If they are the same, I do one thing, else, I do another. I am reaching an out of range issue and was hoping I could find a workaround for this.

CSV:

date    fruit   quantity
4/5/2014 13:34  Apples  73
4/5/2014 3:41   Cherries    85
4/6/2014 12:46  Pears   14
4/8/2014 8:59   Oranges 52
4/10/2014 2:07  Apples  152
4/10/2014 18:10 Bananas 23
4/10/2014 2:40  Strawberries    98

Expected output CSV (backup CSV):

date    fruit   quantity fruitid 
4/5/2014 13:34  Apples  73 fruit0
4/5/2014 3:41   Cherries    85 fruit1
4/6/2014 12:46  Pears   14 fruit2
4/8/2014 8:59   Oranges 52 fruit3
4/10/2014 2:07  Apples  152 fruit0
4/10/2014 18:10 Bananas 23 fruit4
4/10/2014 2:40  Strawberries    98 fruit5

Final CSV:

date    fruitid quantity  
    4/5/2014 13:34  fruit0  73 
    4/5/2014 3:41   fruit1  85 
    4/6/2014 12:46  fruit2  14 
    4/8/2014 8:59   fruit3  52 
    4/10/2014 2:07  fruit0  152 
    4/10/2014 18:10 fruit4  23 
    4/10/2014 2:40  fruit5  98

Code:

import pandas as pd
import numpy
df = pd.read_csv('example2.csv', header=0, dtype='unicode')
df_count = df['fruit'].value_counts()
df.sort_values(['fruit'], ascending=True, inplace=True) #sorting the column 
#fruit
df.reset_index(drop=True, inplace=True)
#print(df)
x = 0 #starting my counter values or position in the column
#old_fruit = df.fruit[x]
#new_fruit = df.fruit[x+1]
df.loc[:,'NewCol'] = 0 # to create the new column
print(df)
for x in range(0, len(df)):
        old_fruit = df.fruit[x] #Starting fruit
        new_fruit = old_fruit[x+1] #next fruit to compare with
        if old_fruit == new_fruit:
                #print(x)
                #print(old_fruit, new_fruit)
                df.NewCol[x] = 'fruit' + str(x) #if they are the same, put 
                #fruit[x] or fruit0 in the current row

        else:
                print("Not the Same")
                #print(x)
                #print(old_fruit, new_fruit)
                df.NewCol[x+1] = 'fruit' +str(x+1) #if they are the same, 
                #put fruit[x+1] or fruit1 in the current row
print(df)

Can you explain what you're trying to do and post some expected output? — Andrew L
– Andrew L, Commented Jun 9, 2017 at 19:27
This is what I would like the output CSV to eventually look like, before I completely replace the fruit column with the Newcol. Eventually, I will do the same process with large amounts of proxy log data. date fruit quantity NewCol 4/5/2014 13:34 Apples 73 fruit0 4/5/2014 3:41 Cherries 85 fruit1 4/6/2014 12:46 Pears 14 fruit2 4/8/2014 8:59 Oranges 52 fruit3 4/10/2014 2:07 Apples 152 fruit0 4/10/2014 18:10 Bananas 23 fruit4 4/10/2014 2:40 Strawberries 98 fruit5 — Travis Cowart
– Travis Cowart, Commented Jun 9, 2017 at 19:30

piRSquared · Accepted Answer · 2017-06-09 20:55:33Z

4

New Answer

Use factorize

df.assign(
    NewCol=np.core.defchararray.add('Fruit', df.fruit.factorize()[0].astype(str))
)

              date         fruit  quantity  NewCol
0   4/5/2014 13:34        Apples        73  Fruit0
1    4/5/2014 3:41      Cherries        85  Fruit1
2   4/6/2014 12:46         Pears        14  Fruit2
3    4/8/2014 8:59       Oranges        52  Fruit3
4   4/10/2014 2:07        Apples       152  Fruit0
5  4/10/2014 18:10       Bananas        23  Fruit4
6   4/10/2014 2:40  Strawberries        98  Fruit5

Not One line, but better

f, u = pd.factorize(df.fruit.values)
n = np.core.defchararray.add('Fruit', f.astype(str))
df.assign(NewCol=n)

              date         fruit  quantity  NewCol
0   4/5/2014 13:34        Apples        73  Fruit0
1    4/5/2014 3:41      Cherries        85  Fruit1
2   4/6/2014 12:46         Pears        14  Fruit2
3    4/8/2014 8:59       Oranges        52  Fruit3
4   4/10/2014 2:07        Apples       152  Fruit0
5  4/10/2014 18:10       Bananas        23  Fruit4
6   4/10/2014 2:40  Strawberries        98  Fruit5

Same Answer but updating df

f, u = pd.factorize(df.fruit.values)
n = np.core.defchararray.add('Fruit', f.astype(str))
df = df.assign(NewCol=n)
# Equivalent to
# df['NewCol'] = n
df

              date         fruit  quantity  NewCol
0   4/5/2014 13:34        Apples        73  Fruit0
1    4/5/2014 3:41      Cherries        85  Fruit1
2   4/6/2014 12:46         Pears        14  Fruit2
3    4/8/2014 8:59       Oranges        52  Fruit3
4   4/10/2014 2:07        Apples       152  Fruit0
5  4/10/2014 18:10       Bananas        23  Fruit4
6   4/10/2014 2:40  Strawberries        98  Fruit5

Old Answer

@SeaMonkey nailed the reason why were seeing the error.

However, I'm guessing at what you were trying to do.
I added cumcount to fruit

df.assign(NewCol=df.fruit + df.groupby('fruit').cumcount().astype(str))

              date         fruit  quantity         NewCol
0   4/5/2014 13:34        Apples        73        Apples0
1    4/5/2014 3:41      Cherries        85      Cherries0
2   4/6/2014 12:46         Pears        14         Pears0
3    4/8/2014 8:59       Oranges        52       Oranges0
4   4/10/2014 2:07        Apples       152        Apples1
5  4/10/2014 18:10       Bananas        23       Bananas0
6   4/10/2014 2:40  Strawberries        98  Strawberries0

edited Jun 9, 2017 at 20:55

answered Jun 9, 2017 at 19:43

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Travis Cowart Over a year ago

This looks amazing. However, I am trying to mask the actual fruit name. So in my proxy log, it will be something like bobsmith and i want it to be something like user1, and johnwayne would be user2. Pseudo encryption I guess?

piRSquared Over a year ago

@TravisCowart this is where it becomes very useful to include your expected results in your question. Edit your question and include that and we'll get you exactly what you need.

Travis Cowart Over a year ago

Thank you. I have updated my question with expected results. Hopefully it is a bit more clear on the 2 additional CSVs I hope to output.

Travis Cowart Over a year ago

The new factorize answer seems like it would replace any need for a for loop. So would that just go after df.reset... and remove all of my existing for loop? I gave it a shot, but printing the resulting df only gives me the sorted df.

SeaMonkey Over a year ago

Hat off to you @piRSquared, that is a very elegant solution!

|

SeaMonkey · Accepted Answer · 2017-06-09 20:13:19Z

2

I think your for-loop is going one index to far,

try:

for x in range(0, len(df)-1):

instead

Edit: it makes sense that:

new_fruit = old_fruit[x+1]

does not give the expected result, old_fruit is not a list but a string. I think what you want is this:

new_fruit = df.fruit[x+1]

Edit (2):

you should add: df.NewCol[x+1] = 'fruit' + str(x)

My working script is:

    import pandas as pd
    import numpy
    df = pd.read_csv('data.csv', header=0, dtype='unicode')
    df_count = df['fruit'].value_counts()
    df.sort_values(['fruit'], ascending=True, inplace=True) #sorting the column 
    #fruit
    df.reset_index(drop=True, inplace=True)
    #print(df)
    x = 0 #starting my counter values or position in the column
    #old_fruit = df.fruit[x]
    #new_fruit = df.fruit[x+1]
    df.loc[:,'NewCol'] = 0 # to create the new column
    print(df)
    for x in range(0, len(df)-1):
            old_fruit = df.fruit[x] #Starting fruit
            new_fruit = df.fruit[x+1] #next fruit to compare with
            if old_fruit == new_fruit:
                    #print(x)
                    #print(old_fruit, new_fruit)
                    df.NewCol[x] = 'fruit' + str(x)
                    df.NewCol[x+1] = 'fruit' + str(x)#if they are the same, put 
                    #fruit[x] or fruit0 in the current row

            else:
                    print("Not the Same")
                    #print(x)
                    #print(old_fruit, new_fruit)
                    df.NewCol[x+1] = 'fruit' +str(x+1) #if they are the same, 
                    #put fruit[x+1] or fruit1 in the current row
    print(df)

edited Jun 9, 2017 at 20:13

answered Jun 9, 2017 at 19:17

SeaMonkey

1319 bronze badges

4 Comments

Travis Cowart Over a year ago

I think that is getting closer, but now I still have these errors: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame & Traceback (most recent call last): File "C:/Python36/csvtester3.py", line 18, in <module> new_fruit = old_fruit[x+1] #next fruit to compare with IndexError: string index out of range

Travis Cowart Over a year ago

That got me very close to what I need by changing new_fruit =df.fruit[x+1]. The only issue now is that the 2nd Apple row shows a 0 for the NewCol value, instead of fruit0 which I wanted.

Travis Cowart Over a year ago

I added the edits into my code and it is almost perfect. It skips fruit1 for whatever reason. It does capture the 2nd instance of Apple and adds fruit0 to the row.

SeaMonkey Over a year ago

I suggest you follow the instructions of @piRSquared, the method I propose is not generic enough, you will need an additional loop to remove all the instances where it skipped a number

Collectives™ on Stack Overflow

Dataframe indexing with for loop

2 Answers 2

8 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related