1

I am trying to implement the 'Bottom-Up Computation' algorithm in data mining (https://www.aaai.org/Papers/FLAIRS/2003/Flairs03-050.pdf).

I need to use the 'pandas' library to create a dataframe and provide it to a recursive function, which should also return a dataframe as output. I am only able to return the final column as output, because I am unable to figure out how to dynamically build a data frame.

Here is the python program:

import pandas as pd

def project_data(df, d):
    return df.iloc[:, d]

def select_data(df, d, val):
    col_name = df.columns[d]
    return df[df[col_name] == val]

def remove_first_dim(df):
    return df.iloc[:, 1:]

def slice_data_dim0(df, v):
    df_temp = select_data(df, 0, v)
    return remove_first_dim(df_temp)

def buc(df):
    dims = df.shape[1]
    if dims == 1:
        input_sum = sum(project_data(df, 0) )
        print(input_sum)
    else:
        dim_vals = set(project_data(df, 0).values)

        for dim_val in dim_vals:
            sub_data = slice_data_dim0(df, dim_val)
            buc(sub_data)
        sub_data = remove_first_dim(df)
        buc(sub_data)


data = {'A':[1,1,1,1,2],
        'B':[1,1,2,3,1],
        'M':[10,20,30,40,50]
        }
    
df = pd.DataFrame(data, columns = ['A','B','M'])
buc(df)

I get the following output:

30
30
40
100
50
50
80
30
40

But what I need is a dataframe, like this (not necessarily formatted, but a data frame):

    A   B   M
0   1   1   30
1   1   2   30
2   1   3   40
3   1   ALL 100
4   2   1   50
5   2   ALL 50
6   ALL 1   80
7   ALL 2   30
8   ALL 3   40
9   ALL ALL 150

How do I achieve this?

1 Answer 1

2

Unfortunately pandas doesn't have functionality to do subtotals - so the trick is to just calculate them on the side and concatenate together with original dataframe.

from itertools import combinations
import numpy as np

dim = ['A', 'B']
vals = ['M']

df = pd.concat(
    [df]
# subtotals:
    + [df.groupby(list(gr), as_index=False)[vals].sum() for r in range(len(dim)-1) for gr in combinations(dim, r+1)]
# total:
    + [df.groupby(np.zeros(len(df)))[vals].sum()]
    )\
    .sort_values(dim)\
    .reset_index(drop=True)\
    .fillna("ALL")

Output:

      A    B    M
0     1    1   10
1     1    1   20
2     1    2   30
3     1    3   40
4     1  ALL  100
5     2    1   50
6     2  ALL   50
7   ALL    1   80
8   ALL    2   30
9   ALL    3   40
10  ALL  ALL  150
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks a million! Is there a way to do this without using itertools? I am not allowed to import anything other than pandas and numpy (yes, this is a school task. )
Yes, but you would have to write your own combinations without repetition function. You can start at the original: docs.python.org/3/library/itertools.html#itertools.combinations
Great, thanks! will check that out. One last thing. The values under columns A and B in the output appear with a .0, for example 1 appears as 1.0, 2 appears as 2.0 etc. This is only for A and B, the output is fine for M. How can I fix this?
I tried using df.astype(int), but it didn't make a difference.
Hm, so the problem with this one is, that int in python doesn't have None - hence when you concat and since these are numbers it defaults to float type. Probably the easiest choice would be to map all dim columns to str: for col in dim: df[col] = df[col].map(are) before you concat.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.