Python Pandas - merge rows if some values are blank

Question

I have a dataset that looks a little like this:

ID   Name            Address      Zip    Cost
1    Bob the Builder 123 Main St  12345  
1    Bob the Builder                     $99,999.99
2    Bob the Builder 123 Sub St   54321  $74,483.01
3    Nigerian Prince Area 51      33333  $999,999.99
3    Pinhead Larry   Las Vegas    31333  $11.00
4    Fox Mulder      Area 51             $0.99

where missing data is okay, unless it's obvious that they can be merged. What I mean by that is instead of the dataset above, I want to merge the rows where both the ID and Name are the same, and the other features can fill in each other's blanks. For example, the dataset above would become:

ID   Name            Address      Zip    Cost
1    Bob the Builder 123 Main St  12345  $99,999.99
2    Bob the Builder 123 Sub St   54321  $74,483.01
3    Nigerian Prince Area 51      33333  $999,999.99
3    Pinhead Larry   Las Vegas    31333  $11.00
4    Fox Mulder      Area 51             $0.99

I've thought about using df.groupby(["ID", "Name"]) and then concatenating the strings since the missing values are empty strings, but got no luck with it.

The data has been scraped off websites, so they've had to go through a lot of cleaning to end up here. I can't think of an elegant way of figuring this out!

piRSquared · Accepted Answer · 2016-11-22 06:03:37Z

4

This only works if rows we are potentially merging are next to each other.

setup

df = pd.DataFrame(dict(
        ID=[1, 1, 2, 3, 3, 4],
        Name=['Bob the Builder'] * 3 + ['Nigerian Prince', 'Pinhead Larry', 'Fox Mulder'],
        Address=['123 Main St', '', '123 Sub St', 'Area 51', 'Las Vegas', 'Area 51'],
        Zip=['12345', '', '54321', '33333', '31333', ''],
        Cost=['', '$99,999.99', '$74,483.01', '$999.999.99', '$11.00', '$0.99']
    ))[['ID', 'Name', 'Address', 'Zip', 'Cost']]

fill up missing
replace('', np.nan) then forward fill then back fill

df_ = df.replace('', np.nan).ffill().bfill()

concat
take last row of filled up df_ if its a duplicate row
take non filled up df if not duplicated

pd.concat([
        df_[df_.duplicated()],
        df.loc[df_.drop_duplicates(keep=False).index]
    ])

answered Nov 22, 2016 at 6:03

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

John Zwinck · Accepted Answer · 2016-11-22 03:56:35Z

0

I'll describe an algorithm:

Put aside all the rows where all fields are populated. We don't need to touch these.
Create a boolean DataFrame like the input where empty fields are False and populated fields are True. This is df.notnull().
For each name in df.Name.unique():
1. Take df[df.Name == name] as the working set.
2. Sum each pair (or tuple) of boolean rows, resulting in a boolean vector the same width as the input columns except those which are always populated. In the example this means [True, True, False] and [False, False, True], so the sum is [1, 1, 1].
3. If the sum is equal to 1 everywhere, that pair (or tuple) of rows can be merged.

But there are a ton of possible edge cases here, such as what to do if you have three rows A,B,C and you could merge either A+B or A+C. It will help if you can narrow down the constraints that exist in the data before implementing the merging algorithm.

answered Nov 22, 2016 at 3:56

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

1 Comment

room_temperature Over a year ago

Thanks a lot! I noticed this pattern as well when I did a groupby() on the dataframe, but didn't really know what to do with it. And I should've clarified on the constraints - the edge cases were already taken care of so it was only sets of 2 rows with duplicates like this.

Collectives™ on Stack Overflow

Python Pandas - merge rows if some values are blank

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related