Create binary columns out of data nested in another dfs columns

Question

This one is weird --

let's say I have a df like this:

user_id     city      state     network
123         austin    tx        att
113         houston   tx        tmobile
343         miami     fl        att
356         seattle   wa        verizon

and I have another df1 like this (these 2 dfs wont be the same shape):

col1
'network': 'att'
'city': 'austin'
'state': 'tx'
'city': 'seattle'

I'm trying to build a final_df like this:

user_id     is_network_att      is_city_austin      is_state_tx    is_city_seattle
123         1                   1                   1              0
113         0                   0                   1              0
343         1                   0                   0              0
356         0                   0                   0              1

Easier to just show it - but a sentence to describe it: I'm trying to create conditional/true-false columns out of df1.col1 in a new final_df that use df column's data.

Strategies I'm tying:

-throw the df1 columns in a list or dictionary and loop through each element and then somehow loop through each row and incorporate and if statement for each row

-maybe make a makeshift column in df1 of the exact code that would create the column in final_df and somehow use the text in this columnd as code

**here's a handful of the rows i'm trying to put in the dictionary

Here's a handful of rows in that I'm trying to put in a dictionary:
912      'organization': 'atlantic metro communications'
913          'isp_name': 'Atlantic Metro Communications'
915                       'location_name': 'martinez ca'
917                       'location_name': 'martinez ca'
918                       'location_name': 'martinez ca'
919                       'location_name': 'martinez ca'
920                     'isp_name': 'Hurricane Electric'
922                 'organization': 'hurricane electric'
923                 'organization': 'hurricane electric'
924                     'isp_name': 'Hurricane Electric'
925                           'count_users_per_ip': 28.0
926      'organization': 'atlantic metro communications'
927          'isp_name': 'Atlantic Metro Communications'
928                     'isp_name': 'Hurricane Electric'
929                 'organization': 'hurricane electric'
930                     'isp_name': 'Hurricane Electric'
931                 'organization': 'hurricane electric'
932                    'location_name': 'hermosillo son'
933      'organization': 'atlantic metro communications'
934          'isp_name': 'Atlantic Metro Communications'
935                             'location_state': ' son'
966                           'count_users_per_ip': 28.0
1057                       'count_users_per_device': 4.0
1218                           'count_ips_per_user': 3.0
1408                    'moderated_action': 'SOFT_BLOCK'
1418                    'moderated_action': 'SOFT_BLOCK'
1430                    'moderated_action': 'SOFT_BLOCK'
1438                    'moderated_action': 'SOFT_BLOCK'
1517                            'app_build': '405000004'
1605                            'app_build': '405000004'

Update - heres as far as Ive got:

def transpose_features(df1,col1,main_df,attr1,attr2):
    from ast import literal_eval

    # dic = literal_eval(f"{{{', '.join(df1[col1])}}}")
    
    dic = {}                                                               
    for i in df_features[attr1].tolist(): 
        dic[i] = df_features[df_features[attr1]==i][attr2].tolist()                                                        

    df_final = (main_df.drop(columns=list(dic))
             .join(main_df[list(dic)].eq(dic).astype(int)
                   .rename(columns=lambda x: f'is_{x}_{dic[x]}')
                  )
          )

    print(df_final.shape)
    return df_final
    
df_final = transpose_features(
    df1 = df_features
    ,col1 = 'attr'
    ,main_df = df
    ,attr1 = 'attr1'
    ,attr2 = 'attr2'
)

df_final.head()

-This code pulls all the values into a list attaches that list to each key in the dictionary. But the issue now is - I need to basically an or statement in the method @mozway provided - that says "does user have ANY of the values in the list in each dict key".

Hard to even type that.

can you provide the constructor for df1? do you have strings? dictionaries? — mozway
– mozway, Commented Jan 20, 2023 at 15:27

mozway · Accepted Answer · 2023-01-23 07:37:08Z

1

Assuming that df1 contains strings, you can first merge them and convert to dictionary, then use it as a reference for comparison with eq:

from ast import literal_eval

# or use a different method to create the dictionary
dic = literal_eval(f"{{{', '.join(df1['col1'])}}}")
# {'network': 'att', 'city': 'austin', 'state': 'tx'}

out = (df.drop(columns=list(dic))
         .join(df[list(dic)].eq(dic).astype(int)
               .rename(columns=lambda x: f'is_{x}_{dic[x]}')
              )
      )

Output:

   user_id  is_network_att  is_city_austin  is_state_tx
0      123               1               1            1
1      113               0               0            1
2      343               1               0            0

Reproducible input:

df = pd.DataFrame({'user_id': [123, 113, 343],
                   'city': ['austin', 'houston', 'miami'],
                   'state': ['tx', 'tx', 'fl'],
                   'network': ['att', 'tmobile', 'att']})

df1 = pd.DataFrame({'col1': ['"network": "att"', '"city": "austin"', '"state": "tx"']})

update to work with duplicated keys

Use a Series instead to handle duplicated keys:

s = df1['col1'].str.extract(r"^'(.*)':\s*'(.*)'$").set_index(0)[1]
it = iter(s)

out = (df.drop(columns=s.index)
         .join(df[s.index].eq(s.tolist()).astype(int)
               .rename(columns=lambda x: f'is_{x}_{next(it)}')
              )
      )

Output:

   user_id  is_network_att  is_city_austin  is_state_tx  is_city_seattle
0      123               1               1            1                0
1      113               0               0            1                0
2      343               1               0            0                0
3      356               0               0            0                1

Reproducible input for the new df1:

df1 = pd.DataFrame({'col1': ["'network': 'att'",
                             "'city': 'austin'",
                             "'state': 'tx'",
                             "'city': 'seattle'"]})

edited Jan 23, 2023 at 7:37

answered Jan 20, 2023 at 15:33

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

max Over a year ago

that dictionary method you used works - but only slightly. it misses most of the rows i have in the column. But I tried another method with dict(zip which did the same thing. Not sure what the issue could be - something weird in the data perhaps.

mozway Over a year ago

Well, the provided method to generate the dictionary is hacky, that why I had asked for a reproducible input, ideally with more examples. There are many ways to generate the dictionary but I'd need to know more about the exact data.

max Over a year ago

added a few rows i'm trying to put in the dictionary above. If you have any thoughts, i'm all ears! thanks again

mozway Over a year ago

@max in the updated example, you have multiple keys with more than one value, however dictionary keys must be unique. How do you want to handle it? Can you add one row "city": "houston" to you initial short example and provide the expected output?

max Over a year ago

you're right - didn't think about that. there's going to be a lot of duplicate keys - but the values for each key will be unique. I don't think a dictionary will work for this.

|

Collectives™ on Stack Overflow

Create binary columns out of data nested in another dfs columns

1 Answer 1

update to work with duplicated keys

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

update to work with duplicated keys

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related