1

Here is a dataframe with sample data:

df = pd.DataFrame({'KEY': ['1','2','3'], 'RECORD': ['1','1','1'], 'SERIAL': ['1470','2321','300'], 'REMARKS': ['FRUIT[APPLES,ORANGES,PEARS] IS HEALTHY FOR YOU','I LIKE FRUIT[BANANAS,CHERRIES,GRAPES], BUT I DON\'T LIKE FRUIT[CANTALOPE,HONEYDEW]', 'THERE IS FRUIT[LEMONS,ORANGES,GRAPEFRUIT] @ 1234']})

First Dataframe

I need to extract out the fruit into a new dataframe associated with the KEY, RECORD, and SERIAL. It should look like this when finished:

df = pd.DataFrame({'KEY': ['1','1','1','2','2','2','2','2','3','3','3'], 'RECORD': ['1','1','1','1','1','1','1','1','1','1','1'], 'SERIAL': ['1470','1470','1470','2321','2321','2321','2321','2321','300','300','300'], 'FRUIT': ['APPLES','ORANGES','PEARS','BANANAS','CHERRIES','GRAPES','CANTALOPE','HONEYDEW','LEMONS','ORANGES','GRAPEFRUIT'], 'CODE': ['null','null','null','null','null','null','null','null','1234','1234','1234']})

Second Dataframe

From the research I've done, it looks like I could use the str.split and/or str.extract, but I'm not sure how to match up each fruit with the KEY, RECORD, and SERIAL. On top of that, the last record has "@ 1234". That information needs to also be extracted and matched up with the 3 fruits listed before it.

I'm guessing the first step in this process is to extract out the fruit, which should be easy because they are all in a series in the string.

Any recommendations on how to tackle this?

Thanks!

1 Answer 1

2

Try this:

df['FruitList'] = df['REMARKS'].str.extract('\[(.+?)\]').squeeze().str.split(',')
df['CODE'] = df['REMARKS'].str.extract('@\s(\d+)')
df.explode('FruitList')

Output:

  KEY RECORD SERIAL                                            REMARKS   FruitList  CODE
0   1      1   1470     FRUIT[APPLES,ORANGES,PEARS] IS HEALTHY FOR YOU      APPLES   NaN
0   1      1   1470     FRUIT[APPLES,ORANGES,PEARS] IS HEALTHY FOR YOU     ORANGES   NaN
0   1      1   1470     FRUIT[APPLES,ORANGES,PEARS] IS HEALTHY FOR YOU       PEARS   NaN
1   2      1   2321  I LIKE FRUIT[BANANAS,CHERRIES,GRAPES], BUT I D...     BANANAS   NaN
1   2      1   2321  I LIKE FRUIT[BANANAS,CHERRIES,GRAPES], BUT I D...    CHERRIES   NaN
1   2      1   2321  I LIKE FRUIT[BANANAS,CHERRIES,GRAPES], BUT I D...      GRAPES   NaN
2   3      1    300   THERE IS FRUIT[LEMONS,ORANGES,GRAPEFRUIT] @ 1234      LEMONS  1234
2   3      1    300   THERE IS FRUIT[LEMONS,ORANGES,GRAPEFRUIT] @ 1234     ORANGES  1234
2   3      1    300   THERE IS FRUIT[LEMONS,ORANGES,GRAPEFRUIT] @ 1234  GRAPEFRUIT  1234

And you can drop REMARKS if you would like:

df.explode('FruitList').drop('REMARKS', axis=1))

Output:

  KEY RECORD SERIAL   FruitList  CODE
0   1      1   1470      APPLES   NaN
0   1      1   1470     ORANGES   NaN
0   1      1   1470       PEARS   NaN
1   2      1   2321     BANANAS   NaN
1   2      1   2321    CHERRIES   NaN
1   2      1   2321      GRAPES   NaN
2   3      1    300      LEMONS  1234
2   3      1    300     ORANGES  1234
2   3      1    300  GRAPEFRUIT  1234
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for this. It gave me a starting point. My only issue is that the brackets are coming across to the new column, which is keeping the exploding from working. However, at least I now know about squeezing/exploding.
From my testing, the split and squeezing are working fine. It is the exploding that is not working. All it returns is the fruit list split apart like this: [APPLES, ORANGES, PEARS] still all on the same line.
I got it finally. I had to reference the explode statement back to df then it worked (df = df.explode('FruitList'))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.