transform pandas dataframe index to be based on dataframe values, while changing values to booleans

Question

I have two datasets of company names with naming variations between the two (ex: Merck & Co. vs Merck & Co vs Merck & Co., Inc.)

I am trying to create a "map" where all the unique variants of "Merck" match to "Merck & Co."

My current approach is to create a dictionary, convert to a dataframe, and ideally perform this transformation described below:

    Company 1        Company 2         Company 3
0   Co1 variant 0    Co1 variant 0     Co3 variant 0
1   Co1 variant 1    Co2 variant 1     
2   Co1 variant 2

Should become:

                  Company 1       Company 2         Company 3
Co1 variant 0     True            True              False
Co1 variant 1     True            False             False
Co1 variant 2     True            False             False
Co2 variant 0     False           True              False
Co2 variant 1     False           True              False
Co3 variant 0     False           True              True

From here if each row only has 1 True value in the row, I would match this naming variant to the column header, otherwise I would match it to itself; allowing the final data set to look like

    Variants        Matched Name
0   Co1 variant 0   Co1 variant 0
1   Co1 variant 1   Company 1
2   Co1 variant 2   Company 1
3   Co2 variant 1   Company 2
4   Co3 variant 0   Company 3

Thank you in advance for any input, I'm also happy to clarify further if the question is not clear enough.

The last one provides the most value to me. If there is another approach to get to the final result I am also open to such recommendations. — wingsoficarus116
– wingsoficarus116, Commented Jan 13, 2017 at 19:10

akuiper · Accepted Answer · 2017-01-13 19:24:48Z

1

You can stack the original data frame and then group by the row id number and the variants, check how many rows you have for each group, if there are only one row, select the company otherwise select the variants itself:

(df.stack().rename("Variants").rename_axis(("RowId", "Company")).reset_index()
 .groupby(["RowId", "Variants"])
 .apply(lambda g: g.Company.iloc[0] if g.Company.size == 1 else g.Variants.iloc[0])
 .rename("Matched Names").reset_index().drop("RowId", axis = 1))

answered Jan 13, 2017 at 19:24

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

transform pandas dataframe index to be based on dataframe values, while changing values to booleans

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related