4

Data I am working with is below:

Name RefSecondary     RefMain
test  2               3   
bet   3               4   
get   1               2   
set   null            1   
net   3               5

I have done a very simple query which looks up the presence of values in dataframe and build hierarchy

sys_role = 'sample.xlsx'
df = pd.read_excel(sys_role,na_filter = False).apply(lambda x: x.astype(str).str.strip())
for i in range(count):
    for j in range(count):
        if df.iloc[i]['RefMain'] == df.iloc[j]['RefSecondary']:
            df.iloc[j, df.columns.get_loc('Name')] = "/".join([df.iloc[i]['Name'],df.iloc[j]['Name']])
    j = j+1
i = i+1

The results I am getting is below:

   Result          RefMain
0  get/test           3
1  test/bet           4
2  set/get            2
3  set                1
4  test/net           5

This is really slow and the logic doesn't work perfectly as well. Is there a way I can get this done faster?

Logic needs to be as below:

 1)Take a value from column RefMain,and find its correspoding RefSecondary value.  
 2)Look up the RefSecondary value  in RefMain, 
 3)If found Back to Step 1 and repeat.
 4)This continues recursively till no value/null is found in RefSecondary column.

Resultant dataframe should look like below:

   Result            RefMain
0  set/get/test          3
1  set/get/test/bet      4
2  set/get               2
3  set                   1
4  set/get/test/net      5
4
  • 1
    The question is missing? Commented Sep 20, 2019 at 6:07
  • Updated to clarify Commented Sep 20, 2019 at 6:20
  • your logic is still not clear, can you try to explain how you got second row for RefMain=4 Commented Sep 23, 2019 at 5:02
  • RefMain 4, has corresponding RefSecondary value of 3. Now 3 Can be found in RefMain Column, and its corresponding RefSecondary is 2. Now 2 can be found in RefMain Column and its RefSecondary is 1. Now 1 can be found in RefMain Column and its RefSecondary is null or no match. SInce no match, hence the flow stops and all values are added up. Commented Sep 23, 2019 at 5:08

4 Answers 4

4
+50

This sounds like a graph problem. You can try networkx as follows:

df = df.fillna(-1)

# create a graph
G = nx.DiGraph()

# add reference as edges
G.add_edges_from(zip(df['RefMain'],df['RefSecondary'] ))

# rename the nodes accordingly
G = nx.relabel_nodes(G, mapping=df.set_index('RefMain')['Name'].to_dict())


# merge the path list to the dataframe
df = df.merge(pd.DataFrame(nx.shortest_path(G)).T['null'], 
              left_on='Name', 
              right_index=True)

# new column:
df['Path'] = df['null'].apply(lambda x: '/'.join(x[-2::-1]) )

Output:

   Name RefSecondary RefMain                         null              Path
0  test            2       3       [test, get, set, null]      set/get/test
1   bet            3       4  [bet, test, get, set, null]  set/get/test/bet
2   get            1       2             [get, set, null]           set/get
3   set         null       1                  [set, null]               set
4   net            3       5  [net, test, get, set, null]  set/get/test/net
Sign up to request clarification or add additional context in comments.

8 Comments

df = df.merge(pd.DataFrame(nx.shortest_path(G)).T[-1] this lie shows KeyError: -1
I did. For clarity I have written the code around how I load the Dataframe as well, just in case that be of some relevance.
Your original data was nan instead of null at row 4. is null the string 'null'? Also, are the numbers string or int?
numbers are int , yes null is string null
In that case, replace the [-1] by ['null']. See edit.
|
2

following code lookup for a ref (1 in this case ) until no row is found. It outputs

def lookup(df, ref):
    arr_result=[]
    result = []
    row = df[df.RefMain==ref]
    while len(row)>0:
        arr_result.append(row.Name.iloc[0])
        result.append(("/".join(arr_result), row.RefMain.iloc[0]))
        row = df[df.RefSecondary == row.RefMain.iloc[0] ]

    return pd.DataFrame(result, columns=["Result", "RefMain"])

lookup(df,1)

Output

Result  RefMain
0   set 1
1   set/get 2
2   set/get/test    3
3   set/get/test/bet    4

in the question above how do you get row "set/get/test/net 5", did I miss something or it is a mistake?

1 Comment

I have updated the question with what I have done till now.
2

You can set the column RefMain as index and access strings using the method reindex():

# Convert 'RefSecondary' to numeric and set 'RefMain' as index
df['RefSecondary'] = pd.to_numeric(df.RefSecondary, errors='coerce')
df.set_index('RefMain', drop=False, inplace=True)

lst = [df['Name'].values]
new_df = df.copy()

# Iterate until all values in 'Name' are NaN 
while new_df['Name'].notna().any():
    new_df = df.reindex(new_df['RefSecondary'])
    lst.append(new_df['Name'].values)

You get the following list of arrays lst:

[array(['test', 'bet', 'get', 'set', 'net'], dtype=object),
 array(['get', 'test', 'set', nan, 'test'], dtype=object),
 array(['set', 'get', nan, nan, 'get'], dtype=object),
 array([nan, 'set', nan, nan, 'set'], dtype=object),
 array([nan, nan, nan, nan, nan], dtype=object)]

Now you can join strings and create a new df.

result = ['/'.join(filter(np.nan.__eq__, i)) for i in zip(*lst[::-1])]
result = pd.DataFrame({'Result': result, 'RefMain': df['RefMain'].values})

Final result:

             Result  RefMain
0      set/get/test        3
1  set/get/test/bet        4
2           set/get        2
3               set        1
4  set/get/test/net        5

Comments

1

This code does the work with merges. It is a bit twisted, but it should run fast because (maybe because) there are no row iterations.

In short, it keeps merging until all new RefSecondary values are null.

I guess it could be further optimized masking the merge operation as well.

df_ref = df.copy()

df.rename(columns={'Name':'Result'},inplace=True)

while not np.all(pd.isnull(df['RefSecondary'])):
    df = df.merge(df_ref,how='left',
                  left_on='RefSecondary',right_on='RefMain',
                  suffixes=['_old',''])
    mask_=pd.notnull(df['RefMain'])
    df.loc[mask_,'Result'] = df.loc[mask_,'Result']+'/'+df.loc[mask_,'Name']
    df.drop(['RefSecondary_old','RefMain_old','Name'],axis='columns',inplace=True)


df = df[['Result']].join(df_ref['RefMain'])

Source data:

df = pd.DataFrame(data=[['test',2,3],
                    ['bet',3,4],
                    ['get',1,2],
                    ['set','null',1],
                    ['net',3,5]], 
              columns=['Name','RefSecondary','RefMain'])

By the way, this code makes assumes that the original data is consistent. For instance, if there were a cycle in the links, it would be trapped in an infinite loop.

3 Comments

df.loc[mask_,'Result'] = df.loc[mask_,'Result']+'/'+df.loc[mask_,'index'] this line shows KeyError: 'Result'
There is no column index which is being renamed in the code above. Also Result is a column I have created in the output dataframe.
I adapted my response to your comments.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.