create hierarchy using two columns in pandas

Question

Data I am working with is below:

Name RefSecondary     RefMain
test  2               3   
bet   3               4   
get   1               2   
set   null            1   
net   3               5

I have done a very simple query which looks up the presence of values in dataframe and build hierarchy

sys_role = 'sample.xlsx'
df = pd.read_excel(sys_role,na_filter = False).apply(lambda x: x.astype(str).str.strip())
for i in range(count):
    for j in range(count):
        if df.iloc[i]['RefMain'] == df.iloc[j]['RefSecondary']:
            df.iloc[j, df.columns.get_loc('Name')] = "/".join([df.iloc[i]['Name'],df.iloc[j]['Name']])
    j = j+1
i = i+1

The results I am getting is below:

   Result          RefMain
0  get/test           3
1  test/bet           4
2  set/get            2
3  set                1
4  test/net           5

This is really slow and the logic doesn't work perfectly as well. Is there a way I can get this done faster?

Logic needs to be as below:

 1)Take a value from column RefMain,and find its correspoding RefSecondary value.  
 2)Look up the RefSecondary value  in RefMain, 
 3)If found Back to Step 1 and repeat.
 4)This continues recursively till no value/null is found in RefSecondary column.

Resultant dataframe should look like below:

   Result            RefMain
0  set/get/test          3
1  set/get/test/bet      4
2  set/get               2
3  set                   1
4  set/get/test/net      5

your logic is still not clear, can you try to explain how you got second row for RefMain=4 — Dev Khadka
– Dev Khadka, Commented Sep 23, 2019 at 5:02
RefMain 4, has corresponding RefSecondary value of 3. Now 3 Can be found in RefMain Column, and its corresponding RefSecondary is 2. Now 2 can be found in RefMain Column and its RefSecondary is 1. Now 1 can be found in RefMain Column and its RefSecondary is null or no match. SInce no match, hence the flow stops and all values are added up. — misguided
– misguided, Commented Sep 23, 2019 at 5:08

Quang Hoang · Accepted Answer · 2019-09-25 00:46:42Z

4

+50

This sounds like a graph problem. You can try networkx as follows:

df = df.fillna(-1)

# create a graph
G = nx.DiGraph()

# add reference as edges
G.add_edges_from(zip(df['RefMain'],df['RefSecondary'] ))

# rename the nodes accordingly
G = nx.relabel_nodes(G, mapping=df.set_index('RefMain')['Name'].to_dict())


# merge the path list to the dataframe
df = df.merge(pd.DataFrame(nx.shortest_path(G)).T['null'], 
              left_on='Name', 
              right_index=True)

# new column:
df['Path'] = df['null'].apply(lambda x: '/'.join(x[-2::-1]) )

Output:

   Name RefSecondary RefMain                         null              Path
0  test            2       3       [test, get, set, null]      set/get/test
1   bet            3       4  [bet, test, get, set, null]  set/get/test/bet
2   get            1       2             [get, set, null]           set/get
3   set         null       1                  [set, null]               set
4   net            3       5  [net, test, get, set, null]  set/get/test/net

edited Sep 25, 2019 at 0:46

answered Sep 24, 2019 at 16:22

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

misguided Over a year ago

df = df.merge(pd.DataFrame(nx.shortest_path(G)).T[-1] this lie shows KeyError: -1

misguided Over a year ago

I did. For clarity I have written the code around how I load the Dataframe as well, just in case that be of some relevance.

Quang Hoang Over a year ago

Your original data was nan instead of null at row 4. is null the string 'null'? Also, are the numbers string or int?

misguided Over a year ago

numbers are int , yes null is string null

Quang Hoang Over a year ago

In that case, replace the [-1] by ['null']. See edit.

|

Dev Khadka · Accepted Answer · 2019-09-20 08:52:39Z

2

following code lookup for a ref (1 in this case ) until no row is found. It outputs

def lookup(df, ref):
    arr_result=[]
    result = []
    row = df[df.RefMain==ref]
    while len(row)>0:
        arr_result.append(row.Name.iloc[0])
        result.append(("/".join(arr_result), row.RefMain.iloc[0]))
        row = df[df.RefSecondary == row.RefMain.iloc[0] ]

    return pd.DataFrame(result, columns=["Result", "RefMain"])

lookup(df,1)

Output

Result  RefMain
0   set 1
1   set/get 2
2   set/get/test    3
3   set/get/test/bet    4

in the question above how do you get row "set/get/test/net 5", did I miss something or it is a mistake?

answered Sep 20, 2019 at 8:52

Dev Khadka

5,5415 gold badges23 silver badges36 bronze badges

1 Comment

misguided Over a year ago

I have updated the question with what I have done till now.

Mykola Zotko · Accepted Answer · 2019-09-27 09:15:46Z

You can set the column RefMain as index and access strings using the method reindex():

# Convert 'RefSecondary' to numeric and set 'RefMain' as index
df['RefSecondary'] = pd.to_numeric(df.RefSecondary, errors='coerce')
df.set_index('RefMain', drop=False, inplace=True)

lst = [df['Name'].values]
new_df = df.copy()

# Iterate until all values in 'Name' are NaN 
while new_df['Name'].notna().any():
    new_df = df.reindex(new_df['RefSecondary'])
    lst.append(new_df['Name'].values)

You get the following list of arrays lst:

[array(['test', 'bet', 'get', 'set', 'net'], dtype=object),
 array(['get', 'test', 'set', nan, 'test'], dtype=object),
 array(['set', 'get', nan, nan, 'get'], dtype=object),
 array([nan, 'set', nan, nan, 'set'], dtype=object),
 array([nan, nan, nan, nan, nan], dtype=object)]

Now you can join strings and create a new df.

result = ['/'.join(filter(np.nan.__eq__, i)) for i in zip(*lst[::-1])]
result = pd.DataFrame({'Result': result, 'RefMain': df['RefMain'].values})

Final result:

             Result  RefMain
0      set/get/test        3
1  set/get/test/bet        4
2           set/get        2
3               set        1
4  set/get/test/net        5

HerrIvan · Accepted Answer · 2019-09-25 07:19:08Z

1

This code does the work with merges. It is a bit twisted, but it should run fast because (maybe because) there are no row iterations.

In short, it keeps merging until all new RefSecondary values are null.

I guess it could be further optimized masking the merge operation as well.

df_ref = df.copy()

df.rename(columns={'Name':'Result'},inplace=True)

while not np.all(pd.isnull(df['RefSecondary'])):
    df = df.merge(df_ref,how='left',
                  left_on='RefSecondary',right_on='RefMain',
                  suffixes=['_old',''])
    mask_=pd.notnull(df['RefMain'])
    df.loc[mask_,'Result'] = df.loc[mask_,'Result']+'/'+df.loc[mask_,'Name']
    df.drop(['RefSecondary_old','RefMain_old','Name'],axis='columns',inplace=True)


df = df[['Result']].join(df_ref['RefMain'])

Source data:

df = pd.DataFrame(data=[['test',2,3],
                    ['bet',3,4],
                    ['get',1,2],
                    ['set','null',1],
                    ['net',3,5]], 
              columns=['Name','RefSecondary','RefMain'])

By the way, this code makes assumes that the original data is consistent. For instance, if there were a cycle in the links, it would be trapped in an infinite loop.

edited Sep 25, 2019 at 7:19

answered Sep 24, 2019 at 7:45

HerrIvan

6595 silver badges18 bronze badges

3 Comments

misguided Over a year ago

df.loc[mask_,'Result'] = df.loc[mask_,'Result']+'/'+df.loc[mask_,'index'] this line shows KeyError: 'Result'

misguided Over a year ago

There is no column index which is being renamed in the code above. Also Result is a column I have created in the output dataframe.

HerrIvan Over a year ago

I adapted my response to your comments.

Collectives™ on Stack Overflow

create hierarchy using two columns in pandas

4 Answers 4

8 Comments

1 Comment

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

8 Comments

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related