I hope the following code will be helpful!
It does create the output as you need it.
import pandas as pd
data = {'agent':['Tom Kim', 'Tom Kim', 'Kim S Tom', 'Kim S Tom','Catherine W','Catherine W'], 'compare':['Tom Kim','Kim S Tom','Tommy Kim','Kim Sun Tom','Wood Catherine','Catherine W']}
names = pd.DataFrame(data)
names.agent = pd.Categorical(names.agent)
names['code'] = names.agent.cat.codes
# create unique names list, adding those names that are identical in both 'agent' and 'compare' columns
names_unique = []
for i in names.index:
if names['agent'][i] == names['compare'][i] and names['agent'][i] not in names_unique:
names_unique.append(names['agent'][i])
# create dictory to store output data, keys: unique names, values: list of all possible names
output_data = {}
for n in names_unique:
output_data[n] = [n]
# for each row of the input datafram
for i in names.index:
# for each key/values of the output data dictonary
for key, values in output_data.items():
# check if one between 'agent' or 'compare' is already among the possible names list of the unique name represented by the key selected by the for loop
# and check if the other column isn't already among the possible names
if names['agent'][i] in values and names['compare'][i] not in values:
# add the name to the possible names list
output_data[key].append(names['compare'][i])
if names['compare'][i] in values and names['agent'][i] not in values:
# add the name to the possible names list
output_data[key].append(names['agent'][i])
# organise your data to easily create your output dataframe
df_structure = {'agent': [], 'compare':[]}
for k in output_data.keys():
for v in output_data[k]:
df_structure['agent'].append(k)
df_structure['compare'].append(v)
# create output dataframe
output_df = pd.DataFrame(df_structure)
output_df
The core of its functioning is in the for loop used to iterate over all the rows of the input database.
At each row it checks if either the name in the column 'agent' or the one in 'compare' is already inside one of the lists of all possible names (the values of the dictionary output_data).
If one of them is already inside, it means that the name in the other column is another possible name. Thus, it checks if the name in the other column is already in the list of possible names, too. If not, it will be added.
In this way the lists of possible names get updated after each row of the input database called by the for loop.
EDIT:
Alternative code that fixes the problem of wrong unique names in the case proposed by the author of the question in the comments:
import pandas as pd
data = {'agent':['Tom Kim', 'Tom Kim', 'Kim S Tom', 'Kim S Tom','Catherine W','Catherine W','Tom Kim'], 'compare':['Tom Kim','Kim S Tom','Tommy Kim','Kim S Tom','Wood Catherine','Catherine W','Kim Sum Tom']}
names = pd.DataFrame(data)
names.agent = pd.Categorical(names.agent)
names['code'] = names.agent.cat.codes
#create a list that will contain the groups of possible names
list_name_groups = []
#iterate over each row of the datafram "names"
for i in names.index:
# create variables to make the code more readable
agent = names['agent'][i]
compare = names['compare'][i]
# variable to check if the names in this row are in any group, starts with 0 (== not in any group)
is_in_any_group = 0
# iterate over each group of names already added to the list
for group in list_name_groups:
# if agent in the group, compare is added to the list of alternative names
if (agent in group) and (compare not in group):
group.append(compare)
is_in_any_group = 1 # an existing group is found
break
# if compare in the group, agent is added to the list of alternative names
if (compare in group) and (agent not in group):
group.append(agent)
is_in_any_group = 1 # an existing group is found
break
# in case both agent and compare are already in the group
if (compare in group) and (agent in group):
is_in_any_group = 1 # an existing group is found
break
# if neither agent nor compare are in the group
if is_in_any_group == 0:
if agent == compare: # avoid duplicates
list_name_groups.append([agent])
else: # the two names are different and both not in the group
list_name_groups.append([agent, compare])
# define the structure of the output dataframe
df_structure = {'agent': [], 'compare':[]}
for group in list_name_groups:
for name in group:
df_structure['agent'].append(group[0]) # the repeated name in the "agent" column is the first of the group
df_structure['compare'].append(name)
# create output dataframe
output_df = pd.DataFrame(df_structure)
output_df
This one works differently, it uses a list of lists instead of a dictionary.
To pick the name that will be repeated in the column "agent" of the output dataframe, the code selects the first of the list of possible alternative names.
I'm not sure if you need to follow a particular criterion ... in that case you may want to modify the final part of the code, where the structure of the output dataframe is defined!