1

Some background: MB column only consists of 1 of 2 values (M or B) while the Indent column contains int. The numbers don't necessarily follow a pattern but if it does increment, it will increment by one. The numbers can decrement by any amount. The rows are sorted in a specific order.

The goal here is to drop rows with Indent values higher than the indent value of a row that contains a "B" value in the MB column. This should only stop once the indent value is equal to or less than the row that contains the "B" value. Below is a chart demonstrating what rows should be dropped.

Example

Sample data:

import pandas as pd
d = {'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'}}
df = pd.DataFrame(d)

Code:

My current code has issues where I cant drop the rows of the inner for loop since it isn't using iterrows. I am aware of dropping based on a conditional expression but I am unsure how to nest this correctly.

for index, row in df.iterrows():
    for row in range(index-1,0,-1):
        if df.loc[row].at["INDENT"] <= df.loc[index].at["INDENT"]-1:
            if df.loc[row].at["MB"]=="B":
                df.drop(df.index[index], inplace=True)
                break
            else:
                break

Edit 1: This problem can be represented graphically. This is effectively scanning a hierarchy for an attribute and deleting anything below it. The example I provided is bad since all rows that need to be dropped are simply indent 3 or higher but this can happen at any indent level. Hierarchy

Edit 2: We are going to cheat on this problem a bit. I won't have to generate an edge graph from scratch since I have the prerequisite data to do this. I have an updated table and sample data. Updated Table

Updated Sample Data

import pandas as pd
d = {
    'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 
    'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'},
    'a': {'0': -1, '1': 5000, '2': 5000, '3': 5322, '4': 5449, '5': 5449, '6': 5621, '7': 5322, '8': 4666},
    'c': {'0': 5000, '1': 5222, '2': 5322, '3': 5449, '4': 5923, '5': 5621, '6': 5109, '7': 4666, '8': 5219}
    }
df = pd.DataFrame(d)

Updated Code

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
d = {
    'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 
    'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'},
    'a': {'0': -1, '1': 5000, '2': 5000, '3': 5322, '4': 5449, '5': 5449, '6': 5621, '7': 5322, '8': 4666},
    'c': {'0': 5000, '1': 5222, '2': 5322, '3': 5449, '4': 5923, '5': 5621, '6': 5109, '7': 4666, '8': 5219}
    }
df = pd.DataFrame(d)

G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'a', 'c', create_using=nx.DiGraph())
T = nx.dfs_tree(G, source=-1).reverse()
print([x for x in T])
nx.draw(G, with_labels=True)
plt.show()

I am unsure how to use the edges from here to identify the rows that need to be dropped from the dataframe

6
  • why index 4 is dropped but not 3? and please provide copy pastable sample data, e.g. df.to_dict() Commented Sep 7, 2021 at 22:23
  • The iteration would be once you get to index 1, you get a B. Going to index 2, there is an indent of 1 which matches the indent value of the "B" on index 1. Thus there isn't anything we would drop as a result of index 1. When We get to index 3, we get to another "B". This time, there is an indent value that is less than the indent at index 3. {'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'}} Commented Sep 7, 2021 at 23:03
  • thanks for the dictionary, but the B at index 3 is greater than the B at index 1 as well, why isn't that dropped? Because there is an M in between? Commented Sep 7, 2021 at 23:21
  • The indent at index 2 is less than or equal to the indent at index 1. This "resets" the conditional statement looking for indent values that are greater than the row with an MB of "B". This is why I need break statements in the inner for loop. Commented Sep 7, 2021 at 23:58
  • Thank you for elaborating and adding a diagramm, I still think there is a problem, logic wise. Its unforunately not clear why index 4 has to be removed. Because index 3 increases by one over index 1, so that is fine, but index 4 also increases by 1 over index 3 but this is not fine? Commented Sep 8, 2021 at 14:45

2 Answers 2

1

Not a answer, but to long for a comment:

import pandas as pd
d = {'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'}}
df = pd.DataFrame(d)

df['i'] = df['INDENT']+1
df = df.reset_index()
df = df.merge(df[['INDENT', 'index', 'MB']].rename(columns={'INDENT':'target', 'index':'ix', 'MB': 'MBt'}), left_on=['i'], right_on=['target'], how='left')

import networkx as nx
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'index', 'ix', create_using=nx.DiGraph())
T = nx.dfs_tree(G, source='0').reverse()
print([x for x in T])
nx.draw(G, with_labels=True)

This demonstrates the problem. You actually want to apply graph theory, the library networkx can help you with that. A first step would be to first construct the connection between each node, like I did in the example above. From there you can try to apply a logic to filter edges you don't want.

Sign up to request clarification or add additional context in comments.

1 Comment

This is good information. I need to do a bunch of research on this since it appears that I could generate an edge list with the given data. In the mean time, I lucked out and was provided with an edge list. Even given an edge list, I don't see a way to identify the rows that need to be dropped.
0

Not sure I fully understand the question you're asking however this is my attempt. It selects only if index > indent and mb ==B. In this case I'm just selecting the subset we want instead of dropping the subset we don't

import numpy as np
import pandas as pd
x=np.transpose(np.array([[0,1,2,3,4,5,6,7,8],[0,1,1,2,3,3,4,2,3],['M','B','M','B','B','M','M','B','M']]))
df=pd.DataFrame(x,columns=['Index','indent','MB'])

df1=df[(df['Index']>=df['indent']) & (df['MB']=='B')]

print(df1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.