0

I'm trying to migrate data between two of our systems and one system has descriptions split across multiple columns while the destination system only has one column. So I need to merge these 5 columns into a single column while removing possible duplicates.

Here is what I have so far, which works, but is there a way to make it faster? Right now it takes a rather long time to iterate over the 13,000 records I'm working on. (Once I add in more data from our other systems in the future the data could easily reach 30,000 records so every second counts)

columns = [
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description"
]
df[columns] = df[columns].replace(np.nan, "")
description_col = []
for i, r in df.iterrows():
    descriptions = []
    for col in columns:
        if r[col] not in descriptions:
            descriptions.append(r[col])
    description = ""
    for d in descriptions:
        description += "\n" + d
    description = description.strip()
    description_col.append(description)
df["Description"] = description_col

So I guess my question really boils down to, is there a better way to do this?

Edit: To clarify, I have to make sure the data is maintained in both systems however the order of the records is not important so long as the data for each record is kept together.

Also, the order of merging the description columns does not matter since most records aren't going to have any data in more than 3 of them at a time. (Most have data in exactly 1 however there are quite a few that have data in 2 or 3 of the columns)

Edit 2: As requested here is some sample data:

columns = [
    "Item.Name",
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description",
    "Other Data"
]
df = pd.DataFrame([
    ["Name", "There is some text here.", "", "Some more here.", "", "", "Other Data"],
    ["Name", "", "Some over here.", "Some here as well.", "", "", "Other Data"],
    ["Name", "Some here.", "", "", "Some here.", "And some here.", "Other Data"],
    ["Name", "", "And here.", "", "", "And here.", "Other Data"]
], columns=columns)
5
  • Does the order of each row need to be preserved? i.e. Does Asset Description always have to come before the Item Description, etc. in the newly created column? Commented Mar 25, 2021 at 16:03
  • Can you show us a small sample df, including before (split columns) and after(merged cols)? And please explain why you don't just do df[newcol] = df[col1] + "\n" + df[col2] + etc..., which would create a 'newcol' with all the strings concatenated and "\n" in between..? is that because you want to drop duplicates? Commented Mar 25, 2021 at 16:20
  • @BrandenCiranni I edited my question to address your question. Commented Mar 25, 2021 at 16:21
  • @Stryder Yes, the main reason I don't do that is because I need to remove duplicates. I'll see if I can work up a sample of the data. Commented Mar 25, 2021 at 16:23
  • @tansonnhut thanks! Commented Mar 25, 2021 at 16:28

1 Answer 1

1

You can use pandas.unique(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html

Here is my implementation, I replaced your \n character with a space so the table would print out without large gaps, just replace that in your code.

import pandas as pd
import numpy as np

columns = [
    "Item.Asset Description",
    "Item.Fixed Asset Sales Description",
    "Item.Item Description",
    "Item.Purchase Description",
    "Item.Sales Description"
]

rows = [
    ['mary', 'had', 'a', 'little', 'lamb'],
    ['little', 'lamb', 'little', 'lamb', np.nan],
    ['mary', 'had', 'a', 'little', 'lamb'],
    ['whose', 'fleece', 'was', 'white', 'as'],
    ['snow', np.nan, np.nan, np.nan, np.nan]
]

df = pd.DataFrame(data=rows, columns=columns).fillna('')

def merge_row(row):
    return ' '.join(pd.unique(row)).strip()

df['Description'] = list(map(merge_row, df.loc[:,columns].values))
Item.Asset Description Item.Fixed Asset Sales Description Item.Item Description Item.Purchase Description Item.Sales Description Description
0 mary had a little lamb mary had a little lamb
1 little lamb little lamb little lamb
2 mary had a little lamb mary had a little lamb
3 whose fleece was white as whose fleece was white as
4 snow snow
Sign up to request clarification or add additional context in comments.

7 Comments

I could do that, however that would require me to remove the data from the original DataFrame so that I don't mess with any of the other data in the data set. And then I would have to merge the data back in after. Overall I have 40 - 50 columns that I have to filter from one system to another. Pandas also gives me an error when running your code, something about "A value is trying to be set on a copy of a slice from a DataFrame." It may have something to do with me extracting a subset of columns, running your code, and then trying to put it back into the main DataFrame.
Hi @tansonnhut, I edited my implementation so it will only compute based on the subset of columns, and I may have fixed the warning you are getting about the copy of a slice - I do not get this error when running the code, but hopefully this works for you.
I edited my code so you don't need to copy the data out ^^ instead of applying the method to df, we just apply to df[columns] instead. Glad it's running faster - always use vectorized implementations with map() over loops - especially with pandas dataframes!
The error was caused because I used chained indexing, I switched to df.loc[:, columns] and it worked fine. I tried your new code and it doesn't work because of np.nan values being introduced when the copy is done. I did fix it by doing the following: df["Description"] = list(map(lambda row: "\n".join(pd.unique(row)).strip(), df.loc[:,columns].replace(np.nan, "").values))
Okay, I'll edit my answer - your na's should be filled before that though?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.