Merging several string columns with possible duplicates in a pandas dataframe

Question

I'm trying to migrate data between two of our systems and one system has descriptions split across multiple columns while the destination system only has one column. So I need to merge these 5 columns into a single column while removing possible duplicates.

Here is what I have so far, which works, but is there a way to make it faster? Right now it takes a rather long time to iterate over the 13,000 records I'm working on. (Once I add in more data from our other systems in the future the data could easily reach 30,000 records so every second counts)

columns = [
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description"
]
df[columns] = df[columns].replace(np.nan, "")
description_col = []
for i, r in df.iterrows():
    descriptions = []
    for col in columns:
        if r[col] not in descriptions:
            descriptions.append(r[col])
    description = ""
    for d in descriptions:
        description += "\n" + d
    description = description.strip()
    description_col.append(description)
df["Description"] = description_col

So I guess my question really boils down to, is there a better way to do this?

Edit: To clarify, I have to make sure the data is maintained in both systems however the order of the records is not important so long as the data for each record is kept together.

Also, the order of merging the description columns does not matter since most records aren't going to have any data in more than 3 of them at a time. (Most have data in exactly 1 however there are quite a few that have data in 2 or 3 of the columns)

Edit 2: As requested here is some sample data:

columns = [
    "Item.Name",
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description",
    "Other Data"
]
df = pd.DataFrame([
    ["Name", "There is some text here.", "", "Some more here.", "", "", "Other Data"],
    ["Name", "", "Some over here.", "Some here as well.", "", "", "Other Data"],
    ["Name", "Some here.", "", "", "Some here.", "And some here.", "Other Data"],
    ["Name", "", "And here.", "", "", "And here.", "Other Data"]
], columns=columns)

Does the order of each row need to be preserved? i.e. Does Asset Description always have to come before the Item Description, etc. in the newly created column? — Branden Ciranni
– Branden Ciranni, Commented Mar 25, 2021 at 16:03
Can you show us a small sample df, including before (split columns) and after(merged cols)? And please explain why you don't just do df[newcol] = df[col1] + "\n" + df[col2] + etc..., which would create a 'newcol' with all the strings concatenated and "\n" in between..? is that because you want to drop duplicates? — Stryder
– Stryder, Commented Mar 25, 2021 at 16:20
@BrandenCiranni I edited my question to address your question. — tansonnhut
– tansonnhut, Commented Mar 25, 2021 at 16:21
@Stryder Yes, the main reason I don't do that is because I need to remove duplicates. I'll see if I can work up a sample of the data. — tansonnhut
– tansonnhut, Commented Mar 25, 2021 at 16:23

Branden Ciranni · Accepted Answer · 2021-03-25 17:25:30Z

1

You can use pandas.unique(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html

Here is my implementation, I replaced your \n character with a space so the table would print out without large gaps, just replace that in your code.

import pandas as pd
import numpy as np

columns = [
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description"
]

rows = [
    ['mary', 'had', 'a', 'little', 'lamb'],
    ['little', 'lamb', 'little', 'lamb', np.nan],
    ['mary', 'had', 'a', 'little', 'lamb'],
    ['whose', 'fleece', 'was', 'white', 'as'],
    ['snow', np.nan, np.nan, np.nan, np.nan]
]

df = pd.DataFrame(data=rows, columns=columns).fillna('')

def merge_row(row):
    return ' '.join(pd.unique(row)).strip()

df['Description'] = list(map(merge_row, df.loc[:,columns].values))

	Item.Asset Description	Item.Fixed Asset Sales Description	Item.Item Description	Item.Purchase Description	Item.Sales Description	Description
0	mary	had	a	little	lamb	mary had a little lamb
1	little	lamb	little	lamb		little lamb
2	mary	had	a	little	lamb	mary had a little lamb
3	whose	fleece	was	white	as	whose fleece was white as
4	snow					snow

edited Mar 25, 2021 at 17:25

answered Mar 25, 2021 at 16:24

Branden Ciranni

4922 silver badges5 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

tansonnhut Over a year ago

I could do that, however that would require me to remove the data from the original DataFrame so that I don't mess with any of the other data in the data set. And then I would have to merge the data back in after. Overall I have 40 - 50 columns that I have to filter from one system to another. Pandas also gives me an error when running your code, something about "A value is trying to be set on a copy of a slice from a DataFrame." It may have something to do with me extracting a subset of columns, running your code, and then trying to put it back into the main DataFrame.

Branden Ciranni Over a year ago

Hi @tansonnhut, I edited my implementation so it will only compute based on the subset of columns, and I may have fixed the warning you are getting about the copy of a slice - I do not get this error when running the code, but hopefully this works for you.

Branden Ciranni Over a year ago

I edited my code so you don't need to copy the data out ^^ instead of applying the method to df, we just apply to df[columns] instead. Glad it's running faster - always use vectorized implementations with map() over loops - especially with pandas dataframes!

tansonnhut Over a year ago

The error was caused because I used chained indexing, I switched to df.loc[:, columns] and it worked fine. I tried your new code and it doesn't work because of np.nan values being introduced when the copy is done. I did fix it by doing the following:

df["Description"] = list(map(lambda row: "\n".join(pd.unique(row)).strip(), df.loc[:,columns].replace(np.nan, "").values))

Branden Ciranni Over a year ago

Okay, I'll edit my answer - your na's should be filled before that though?

|

Collectives™ on Stack Overflow

Merging several string columns with possible duplicates in a pandas dataframe

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related