1

I have struggles when trying to group a dataframe based on the matches of its values, let's say:

print(crosstabsdf1)
Index  Area     Area_2
0      188        181
1      190        188
2      192        190
3      115        110
4      138        121
...    ...        ...
2510   173        174
2511   177        178
2512   174        175
2513   176        177
2604   181        182

[361 rows x 2 columns] 

When I seek for the matches of a value, for instance:

crosstabsdf1[crosstabsdf1['Area']==181]

Index  Area  Area_2
9     181       175
260   181       182

crosstabsdf1[crosstabsdf1['Area_2']==181]

Index   Area   Area_2
0       188     181
157     180     181

So, I would like to seek for grouping all of the matches between each pair (by match I mean, that, when I have a row:

Area Area_2
181  175
181  182
188  181
180  181

it means that the areas 181 and 175, and 181-182, and so on, are adjacent),

So, is there a pandas way (or maybe a defined function) to group each Area and display it into many rows depending on its ocurrances of adjacency with other areas, like this:

Index    Area       Area_2
0        181          175
1        181          180
2        181          182
3        181          188

Thank you

1 Answer 1

1

Based on the example you provide, you could try this:

import pandas as pd


def match(df, col, other_col, value):
    """Find rows matching a giver value.

    Args:
        df (pd.DataFrame): target dataframe
        col (str): label of the first column
        other_col (str): label of the second column
        value (int): target value

    Returns:
        pd.DataFrame: rows matching value
    """
    # Find value in both columns
    area = df.loc[(df[col] == value), other_col]
    area_2 = df.loc[(df[other_col] == value), col]
    
    # Concat rows, add new column and return new df with sorted columns
    new_df = pd.DataFrame(pd.concat([area, area_2]), columns=[other_col])
    new_df.loc[:, col] = value

    return new_df.reindex(sorted(new_df.columns), axis=1)


df = pd.DataFrame(
    {
        "Area": [181, 181, 188, 180, 173, 176, 138],
        "Area_2": [175, 182, 181, 181, 174, 177, 121],
    }
)

print(match(df, "Area", "Area_2", 181))
# Outputs
   Area  Area_2
0   181     175
1   181     182
2   181     188
3   181     180

Now, to apply this on the whole dataframe, you could go on like this:

# Put all intermediate dataframes in a list by applying "match"
# to existing and unique values of "Area" column
dfs = [match(df, "Area", "Area_2", e) for e in df["Area"].unique()]

# Iterate and concatenate
new_df = dfs[0]
for df in dfs[1:]:
    new_df = pd.concat([new_df, df])

# Clean up
new_df = new_df.sort_values(by=["Area", "Area_2"]).reset_index(drop=True)

print(new_df)
# Outputs
   Area  Area_2
0   138     121
1   173     174
2   176     177
3   180     181
4   181     175
5   181     180
6   181     182
7   181     188
8   188     181
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you. If I wanted a df with all values, then I would apply a loop of that function and then concatenate those results, right?
Edit: I've tried that and it can't be done (for e in range(1,200): match(crosszonesdf, "Zone", "ZoneMatch",e).drop_duplicates()) It seems that the dataframe doesn't soport the indexing based on the 'e' in the loop
Works fine, as long as you iterate on existing values, see my updated answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.