0

I am trying to extract the location codes / product codes from a sql table using pandas. The field is an array type, i.e. it has multiple values as a list within each row. I have to extract values from string for product/location codes.

Here is a sample of the table

df.head()
Target_Type Constraints
45          ti_8188,to_8188,r_8188,trad_8188_1,to_9258,ti_9258,r_9258,trad_9258_1   
45          ti_8188,to_8188,r_8188,trad_8188_1,trad_22420_1   
45          ti_8894,trad_8894_0.2

Now I want to extract the numeric values of the codes. I also want to ignore the end float values after 2nd underscore in the entries, i.e. ignore the _1, _0.2 etc.

Here is a sample output I want to achieve. It should be unique list/df column of all the extracted values -

 Target_Type_45_df.head()
 Constraints
 8188
 9258
 22420
 8894

I have never worked with nested/array type of column before. Any help would be appreciated.

2 Answers 2

1

You can use explode to bring each variable into a single cell, under one column:

df = df.explode('Constraints')
df['newConst'] = df['Constraints'].apply(lambda x: str(x).split('_')[1])
Sign up to request clarification or add additional context in comments.

3 Comments

thanks @yashar and @adrian. I was able to solve it. Although, direct explode didn't work, I used the following to help me with the issue - df = df.assign(constraints=df['constraints'].str.split(',')).explode('constraints') df['constraints'] = df['constraints'].apply(lambda x: str(x).split('_')[1])
yw :) I am just curious why the explode didn't work. I replicated the sample data and it worked just fine. Can you tell me if you got an error or unusable result?
it didn't throw any error. It simply didn't do anything. my dataframe was same after running the script.
1

I would think the following overall strategy would work well (you'll need to debug):

  1. Define a function that takes a row as input (the idea being to broadcast this function with the pandas .apply method).
  2. In this function, set my_list = row['Constraints'].
  3. Then do my_list = my_list.split(','). Now you have a list, with no commas.
  4. Next, split with the underscore, take the second element (index 1), and convert to int:
numbers = [int(element.split('_')[1]) for element in my_list]
  1. Finally, convert to set: return set(numbers)

The output for each row will be a set - just union all these sets together to get the final result.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.