2

this is a follow up question to the one yesterday. I have a dataframe created by a csv file, and I am trying to compare a current and next value. If they are the same, I do one thing, else, I do another. I am reaching an out of range issue and was hoping I could find a workaround for this.

CSV:

date    fruit   quantity
4/5/2014 13:34  Apples  73
4/5/2014 3:41   Cherries    85
4/6/2014 12:46  Pears   14
4/8/2014 8:59   Oranges 52
4/10/2014 2:07  Apples  152
4/10/2014 18:10 Bananas 23
4/10/2014 2:40  Strawberries    98

Expected output CSV (backup CSV):

date    fruit   quantity fruitid 
4/5/2014 13:34  Apples  73 fruit0
4/5/2014 3:41   Cherries    85 fruit1
4/6/2014 12:46  Pears   14 fruit2
4/8/2014 8:59   Oranges 52 fruit3
4/10/2014 2:07  Apples  152 fruit0
4/10/2014 18:10 Bananas 23 fruit4
4/10/2014 2:40  Strawberries    98 fruit5

Final CSV:

date    fruitid quantity  
    4/5/2014 13:34  fruit0  73 
    4/5/2014 3:41   fruit1  85 
    4/6/2014 12:46  fruit2  14 
    4/8/2014 8:59   fruit3  52 
    4/10/2014 2:07  fruit0  152 
    4/10/2014 18:10 fruit4  23 
    4/10/2014 2:40  fruit5  98 

Code:

import pandas as pd
import numpy
df = pd.read_csv('example2.csv', header=0, dtype='unicode')
df_count = df['fruit'].value_counts()
df.sort_values(['fruit'], ascending=True, inplace=True) #sorting the column 
#fruit
df.reset_index(drop=True, inplace=True)
#print(df)
x = 0 #starting my counter values or position in the column
#old_fruit = df.fruit[x]
#new_fruit = df.fruit[x+1]
df.loc[:,'NewCol'] = 0 # to create the new column
print(df)
for x in range(0, len(df)):
        old_fruit = df.fruit[x] #Starting fruit
        new_fruit = old_fruit[x+1] #next fruit to compare with
        if old_fruit == new_fruit:
                #print(x)
                #print(old_fruit, new_fruit)
                df.NewCol[x] = 'fruit' + str(x) #if they are the same, put 
                #fruit[x] or fruit0 in the current row

        else:
                print("Not the Same")
                #print(x)
                #print(old_fruit, new_fruit)
                df.NewCol[x+1] = 'fruit' +str(x+1) #if they are the same, 
                #put fruit[x+1] or fruit1 in the current row
print(df)
2
  • Can you explain what you're trying to do and post some expected output? Commented Jun 9, 2017 at 19:27
  • This is what I would like the output CSV to eventually look like, before I completely replace the fruit column with the Newcol. Eventually, I will do the same process with large amounts of proxy log data. date fruit quantity NewCol 4/5/2014 13:34 Apples 73 fruit0 4/5/2014 3:41 Cherries 85 fruit1 4/6/2014 12:46 Pears 14 fruit2 4/8/2014 8:59 Oranges 52 fruit3 4/10/2014 2:07 Apples 152 fruit0 4/10/2014 18:10 Bananas 23 fruit4 4/10/2014 2:40 Strawberries 98 fruit5 Commented Jun 9, 2017 at 19:30

2 Answers 2

4

New Answer

Use factorize

df.assign(
    NewCol=np.core.defchararray.add('Fruit', df.fruit.factorize()[0].astype(str))
)

              date         fruit  quantity  NewCol
0   4/5/2014 13:34        Apples        73  Fruit0
1    4/5/2014 3:41      Cherries        85  Fruit1
2   4/6/2014 12:46         Pears        14  Fruit2
3    4/8/2014 8:59       Oranges        52  Fruit3
4   4/10/2014 2:07        Apples       152  Fruit0
5  4/10/2014 18:10       Bananas        23  Fruit4
6   4/10/2014 2:40  Strawberries        98  Fruit5

Not One line, but better

f, u = pd.factorize(df.fruit.values)
n = np.core.defchararray.add('Fruit', f.astype(str))
df.assign(NewCol=n)

              date         fruit  quantity  NewCol
0   4/5/2014 13:34        Apples        73  Fruit0
1    4/5/2014 3:41      Cherries        85  Fruit1
2   4/6/2014 12:46         Pears        14  Fruit2
3    4/8/2014 8:59       Oranges        52  Fruit3
4   4/10/2014 2:07        Apples       152  Fruit0
5  4/10/2014 18:10       Bananas        23  Fruit4
6   4/10/2014 2:40  Strawberries        98  Fruit5

Same Answer but updating df

f, u = pd.factorize(df.fruit.values)
n = np.core.defchararray.add('Fruit', f.astype(str))
df = df.assign(NewCol=n)
# Equivalent to
# df['NewCol'] = n
df

              date         fruit  quantity  NewCol
0   4/5/2014 13:34        Apples        73  Fruit0
1    4/5/2014 3:41      Cherries        85  Fruit1
2   4/6/2014 12:46         Pears        14  Fruit2
3    4/8/2014 8:59       Oranges        52  Fruit3
4   4/10/2014 2:07        Apples       152  Fruit0
5  4/10/2014 18:10       Bananas        23  Fruit4
6   4/10/2014 2:40  Strawberries        98  Fruit5

Old Answer

@SeaMonkey nailed the reason why were seeing the error.

However, I'm guessing at what you were trying to do.
I added cumcount to fruit

df.assign(NewCol=df.fruit + df.groupby('fruit').cumcount().astype(str))

              date         fruit  quantity         NewCol
0   4/5/2014 13:34        Apples        73        Apples0
1    4/5/2014 3:41      Cherries        85      Cherries0
2   4/6/2014 12:46         Pears        14         Pears0
3    4/8/2014 8:59       Oranges        52       Oranges0
4   4/10/2014 2:07        Apples       152        Apples1
5  4/10/2014 18:10       Bananas        23       Bananas0
6   4/10/2014 2:40  Strawberries        98  Strawberries0
Sign up to request clarification or add additional context in comments.

8 Comments

This looks amazing. However, I am trying to mask the actual fruit name. So in my proxy log, it will be something like bobsmith and i want it to be something like user1, and johnwayne would be user2. Pseudo encryption I guess?
@TravisCowart this is where it becomes very useful to include your expected results in your question. Edit your question and include that and we'll get you exactly what you need.
Thank you. I have updated my question with expected results. Hopefully it is a bit more clear on the 2 additional CSVs I hope to output.
The new factorize answer seems like it would replace any need for a for loop. So would that just go after df.reset... and remove all of my existing for loop? I gave it a shot, but printing the resulting df only gives me the sorted df.
Hat off to you @piRSquared, that is a very elegant solution!
|
2

I think your for-loop is going one index to far,

try:

for x in range(0, len(df)-1):

instead

Edit: it makes sense that:

new_fruit = old_fruit[x+1]

does not give the expected result, old_fruit is not a list but a string. I think what you want is this:

new_fruit = df.fruit[x+1]

Edit (2):

you should add: df.NewCol[x+1] = 'fruit' + str(x)

My working script is:

    import pandas as pd
    import numpy
    df = pd.read_csv('data.csv', header=0, dtype='unicode')
    df_count = df['fruit'].value_counts()
    df.sort_values(['fruit'], ascending=True, inplace=True) #sorting the column 
    #fruit
    df.reset_index(drop=True, inplace=True)
    #print(df)
    x = 0 #starting my counter values or position in the column
    #old_fruit = df.fruit[x]
    #new_fruit = df.fruit[x+1]
    df.loc[:,'NewCol'] = 0 # to create the new column
    print(df)
    for x in range(0, len(df)-1):
            old_fruit = df.fruit[x] #Starting fruit
            new_fruit = df.fruit[x+1] #next fruit to compare with
            if old_fruit == new_fruit:
                    #print(x)
                    #print(old_fruit, new_fruit)
                    df.NewCol[x] = 'fruit' + str(x)
                    df.NewCol[x+1] = 'fruit' + str(x)#if they are the same, put 
                    #fruit[x] or fruit0 in the current row

            else:
                    print("Not the Same")
                    #print(x)
                    #print(old_fruit, new_fruit)
                    df.NewCol[x+1] = 'fruit' +str(x+1) #if they are the same, 
                    #put fruit[x+1] or fruit1 in the current row
    print(df)

4 Comments

I think that is getting closer, but now I still have these errors: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame & Traceback (most recent call last): File "C:/Python36/csvtester3.py", line 18, in <module> new_fruit = old_fruit[x+1] #next fruit to compare with IndexError: string index out of range
That got me very close to what I need by changing new_fruit =df.fruit[x+1]. The only issue now is that the 2nd Apple row shows a 0 for the NewCol value, instead of fruit0 which I wanted.
I added the edits into my code and it is almost perfect. It skips fruit1 for whatever reason. It does capture the 2nd instance of Apple and adds fruit0 to the row.
I suggest you follow the instructions of @piRSquared, the method I propose is not generic enough, you will need an additional loop to remove all the instances where it skipped a number

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.