0

This one is weird --

let's say I have a df like this:

user_id     city      state     network
123         austin    tx        att
113         houston   tx        tmobile
343         miami     fl        att
356         seattle   wa        verizon

and I have another df1 like this (these 2 dfs wont be the same shape):

col1
'network': 'att'
'city': 'austin'
'state': 'tx'
'city': 'seattle'

I'm trying to build a final_df like this:

user_id     is_network_att      is_city_austin      is_state_tx    is_city_seattle
123         1                   1                   1              0
113         0                   0                   1              0
343         1                   0                   0              0
356         0                   0                   0              1

Easier to just show it - but a sentence to describe it: I'm trying to create conditional/true-false columns out of df1.col1 in a new final_df that use df column's data.

Strategies I'm tying:

-throw the df1 columns in a list or dictionary and loop through each element and then somehow loop through each row and incorporate and if statement for each row

-maybe make a makeshift column in df1 of the exact code that would create the column in final_df and somehow use the text in this columnd as code

**here's a handful of the rows i'm trying to put in the dictionary

Here's a handful of rows in that I'm trying to put in a dictionary:
912      'organization': 'atlantic metro communications'
913          'isp_name': 'Atlantic Metro Communications'
915                       'location_name': 'martinez ca'
917                       'location_name': 'martinez ca'
918                       'location_name': 'martinez ca'
919                       'location_name': 'martinez ca'
920                     'isp_name': 'Hurricane Electric'
922                 'organization': 'hurricane electric'
923                 'organization': 'hurricane electric'
924                     'isp_name': 'Hurricane Electric'
925                           'count_users_per_ip': 28.0
926      'organization': 'atlantic metro communications'
927          'isp_name': 'Atlantic Metro Communications'
928                     'isp_name': 'Hurricane Electric'
929                 'organization': 'hurricane electric'
930                     'isp_name': 'Hurricane Electric'
931                 'organization': 'hurricane electric'
932                    'location_name': 'hermosillo son'
933      'organization': 'atlantic metro communications'
934          'isp_name': 'Atlantic Metro Communications'
935                             'location_state': ' son'
966                           'count_users_per_ip': 28.0
1057                       'count_users_per_device': 4.0
1218                           'count_ips_per_user': 3.0
1408                    'moderated_action': 'SOFT_BLOCK'
1418                    'moderated_action': 'SOFT_BLOCK'
1430                    'moderated_action': 'SOFT_BLOCK'
1438                    'moderated_action': 'SOFT_BLOCK'
1517                            'app_build': '405000004'
1605                            'app_build': '405000004'

Update - heres as far as Ive got:

def transpose_features(df1,col1,main_df,attr1,attr2):
    from ast import literal_eval

    # dic = literal_eval(f"{{{', '.join(df1[col1])}}}")
    
    dic = {}                                                               
    for i in df_features[attr1].tolist(): 
        dic[i] = df_features[df_features[attr1]==i][attr2].tolist()                                                        

    df_final = (main_df.drop(columns=list(dic))
             .join(main_df[list(dic)].eq(dic).astype(int)
                   .rename(columns=lambda x: f'is_{x}_{dic[x]}')
                  )
          )

    print(df_final.shape)
    return df_final
    
df_final = transpose_features(
    df1 = df_features
    ,col1 = 'attr'
    ,main_df = df
    ,attr1 = 'attr1'
    ,attr2 = 'attr2'
)

df_final.head()

-This code pulls all the values into a list attaches that list to each key in the dictionary. But the issue now is - I need to basically an or statement in the method @mozway provided - that says "does user have ANY of the values in the list in each dict key".

Hard to even type that.

1
  • 1
    can you provide the constructor for df1? do you have strings? dictionaries? Commented Jan 20, 2023 at 15:27

1 Answer 1

1

Assuming that df1 contains strings, you can first merge them and convert to dictionary, then use it as a reference for comparison with eq:

from ast import literal_eval

# or use a different method to create the dictionary
dic = literal_eval(f"{{{', '.join(df1['col1'])}}}")
# {'network': 'att', 'city': 'austin', 'state': 'tx'}

out = (df.drop(columns=list(dic))
         .join(df[list(dic)].eq(dic).astype(int)
               .rename(columns=lambda x: f'is_{x}_{dic[x]}')
              )
      )

Output:

   user_id  is_network_att  is_city_austin  is_state_tx
0      123               1               1            1
1      113               0               0            1
2      343               1               0            0

Reproducible input:

df = pd.DataFrame({'user_id': [123, 113, 343],
                   'city': ['austin', 'houston', 'miami'],
                   'state': ['tx', 'tx', 'fl'],
                   'network': ['att', 'tmobile', 'att']})

df1 = pd.DataFrame({'col1': ['"network": "att"', '"city": "austin"', '"state": "tx"']})
update to work with duplicated keys

Use a Series instead to handle duplicated keys:

s = df1['col1'].str.extract(r"^'(.*)':\s*'(.*)'$").set_index(0)[1]
it = iter(s)

out = (df.drop(columns=s.index)
         .join(df[s.index].eq(s.tolist()).astype(int)
               .rename(columns=lambda x: f'is_{x}_{next(it)}')
              )
      )

Output:

   user_id  is_network_att  is_city_austin  is_state_tx  is_city_seattle
0      123               1               1            1                0
1      113               0               0            1                0
2      343               1               0            0                0
3      356               0               0            0                1

Reproducible input for the new df1:

df1 = pd.DataFrame({'col1': ["'network': 'att'",
                             "'city': 'austin'",
                             "'state': 'tx'",
                             "'city': 'seattle'"]})
Sign up to request clarification or add additional context in comments.

6 Comments

that dictionary method you used works - but only slightly. it misses most of the rows i have in the column. But I tried another method with dict(zip which did the same thing. Not sure what the issue could be - something weird in the data perhaps.
Well, the provided method to generate the dictionary is hacky, that why I had asked for a reproducible input, ideally with more examples. There are many ways to generate the dictionary but I'd need to know more about the exact data.
added a few rows i'm trying to put in the dictionary above. If you have any thoughts, i'm all ears! thanks again
@max in the updated example, you have multiple keys with more than one value, however dictionary keys must be unique. How do you want to handle it? Can you add one row "city": "houston" to you initial short example and provide the expected output?
you're right - didn't think about that. there's going to be a lot of duplicate keys - but the values for each key will be unique. I don't think a dictionary will work for this.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.