Re-indexing a dataframe based on matching values of two columns

Question

I have struggles when trying to group a dataframe based on the matches of its values, let's say:

print(crosstabsdf1)
Index  Area     Area_2
0      188        181
1      190        188
2      192        190
3      115        110
4      138        121
...    ...        ...
2510   173        174
2511   177        178
2512   174        175
2513   176        177
2604   181        182

[361 rows x 2 columns]

When I seek for the matches of a value, for instance:

crosstabsdf1[crosstabsdf1['Area']==181]

Index  Area  Area_2
9     181       175
260   181       182

crosstabsdf1[crosstabsdf1['Area_2']==181]

Index   Area   Area_2
0       188     181
157     180     181

So, I would like to seek for grouping all of the matches between each pair (by match I mean, that, when I have a row:

Area Area_2
181  175
181  182
188  181
180  181

it means that the areas 181 and 175, and 181-182, and so on, are adjacent),

So, is there a pandas way (or maybe a defined function) to group each Area and display it into many rows depending on its ocurrances of adjacency with other areas, like this:

Index    Area       Area_2
0        181          175
1        181          180
2        181          182
3        181          188

Thank you

Laurent · Accepted Answer · 2021-08-13 19:06:53Z

1

Based on the example you provide, you could try this:

import pandas as pd


def match(df, col, other_col, value):
    """Find rows matching a giver value.

    Args:
        df (pd.DataFrame): target dataframe
        col (str): label of the first column
        other_col (str): label of the second column
        value (int): target value

    Returns:
        pd.DataFrame: rows matching value
    """
    # Find value in both columns
    area = df.loc[(df[col] == value), other_col]
    area_2 = df.loc[(df[other_col] == value), col]
    
    # Concat rows, add new column and return new df with sorted columns
    new_df = pd.DataFrame(pd.concat([area, area_2]), columns=[other_col])
    new_df.loc[:, col] = value

    return new_df.reindex(sorted(new_df.columns), axis=1)


df = pd.DataFrame(
    {
        "Area": [181, 181, 188, 180, 173, 176, 138],
        "Area_2": [175, 182, 181, 181, 174, 177, 121],
    }
)

print(match(df, "Area", "Area_2", 181))
# Outputs
   Area  Area_2
0   181     175
1   181     182
2   181     188
3   181     180

Now, to apply this on the whole dataframe, you could go on like this:

# Put all intermediate dataframes in a list by applying "match"
# to existing and unique values of "Area" column
dfs = [match(df, "Area", "Area_2", e) for e in df["Area"].unique()]

# Iterate and concatenate
new_df = dfs[0]
for df in dfs[1:]:
    new_df = pd.concat([new_df, df])

# Clean up
new_df = new_df.sort_values(by=["Area", "Area_2"]).reset_index(drop=True)

print(new_df)
# Outputs
   Area  Area_2
0   138     121
1   173     174
2   176     177
3   180     181
4   181     175
5   181     180
6   181     182
7   181     188
8   188     181

edited Aug 13, 2021 at 19:06

answered Aug 13, 2021 at 9:12

Laurent

13.7k7 gold badges30 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Arnoldo Oliva Over a year ago

Thank you. If I wanted a df with all values, then I would apply a loop of that function and then concatenate those results, right?

Arnoldo Oliva Over a year ago

Edit: I've tried that and it can't be done (for e in range(1,200): match(crosszonesdf, "Zone", "ZoneMatch",e).drop_duplicates()) It seems that the dataframe doesn't soport the indexing based on the 'e' in the loop

Laurent Over a year ago

Works fine, as long as you iterate on existing values, see my updated answer.

Collectives™ on Stack Overflow

Re-indexing a dataframe based on matching values of two columns

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related