Manipulating DataFrame with custom dataclass methods

Question

I have upwards of 4000 lines of code that analyze, manipulate, compare and plot 2 huge .csv documents. For readability and future publication, I'd like to convert to object-oriented classes. I convert them to pd.DataFrames:

my_data1 = pd.DataFrame(np.random.randn(100, 9), columns=list('123456789'))
my_data2 = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

I have functions that compare various aspects of each of the datasets and functions that only use the datasets individually. I want to convert this structure into a dataclass with methods for each dataframe.

I can't manipulate these dataframes through my class functions. I keep getting NameError: name 'self' is not defined. Here's my dataclass structure:

@dataclass
class Data:
    ser = pd.DataFrame 

    # def __post_init__(self):
    #     self.ser = self.clean()

    def clean(self, ser):
        acceptcols = np.where(ser.loc[0, :] == '2')[0]
        data = ser.iloc[:, np.insert(acceptcols, 0, 0)]
        data = ser.drop(0)
        data = ser.rename(columns={'': 'Time(s)'})
        data = ser.astype(float)
        data = ser.reset_index(drop=True)
        data.columns = [column.replace('1', '')
                        for column in ser.columns]

        return data


my_data1 = pd.DataFrame(np.random.randn(100, 9), columns=list('123456789'))
my_data2 = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# Attempt 1
new_data1 = Data.clean(my_data1) # Parameter "ser" unfilled 
# Attempt 2
new_data1 = Data.clean(ser=my_data1) # Parameter "self" unfilled 
# Attempt 3
new_data1 = Data.clean(self, my_data1) # Unresolved reference "self"

I have tried various forms of defining def clean(self and other stuff) but I think I just don't understand classes or class structure enough. Documentation on classes and dataclasses always use very rudimentary examples, I've tried cut/pasting a template to no avail. What am I missing?

Raymond Kwok · Accepted Answer · 2022-01-22 03:27:26Z

2

you can first get an instance x of the class Data.

x = Data()

# Attempt 1
new_data1 = x.clean(my_data1) # Parameter "ser" unfilled 
# Attempt 2
new_data1 = x.clean(ser=my_data1) # Parameter "self" unfilled

If I were you I would not use a class this way, I would instead just define the following function

def clean(ser):
        acceptcols = np.where(ser.loc[0, :] == '2')[0]
        data = ser.iloc[:, np.insert(acceptcols, 0, 0)]
        data = ser.drop(0)
        data = ser.rename(columns={'': 'Time(s)'})
        data = ser.astype(float)
        data = ser.reset_index(drop=True)
        data.columns = [column.replace('1', '')
                        for column in ser.columns]

        return data

and call it directly.

Also, in your clean(), each modification is based on ser which is the input, but not the last modification. This is a problem, isn't this?

answered Jan 22, 2022 at 3:27

Raymond Kwok

2,5512 gold badges11 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Flynn O'Connell Over a year ago

The last modification is (or should be) a modification on the input, ser, it's goal is to replace every column with a string '1' with an empty string. And at first, it was just a function like you included above, but the dataset got very complex and I wanted to test functions in classes to try to make everything more readable. Why do you think I shouldn't use classes ike this?

Raymond Kwok Over a year ago

#1 even if you just define a function clean, you can call it as many times as you want so the code is still readable. A class is better only if it represents a object that has many methods and many attributes. In your case, a function is sufficient.

Raymond Kwok Over a year ago

#2 you have 6 modification in clean but only the last 2 will appear in your final outcome. If you want all modifications back, please change all ser into data EXCEPT for the 1st and the 2nd one.

Raymond Kwok Over a year ago

If I were you I would still not use a class. A class represents an object, and in your case you only want to group your functions so this is the difference. I would choose one of the following 2 ways. #1 Define all functions as just functions, and call them wherever needed. #2 Put functions specific to my_data1 in one .py file; and the same for my_data2; and lastly put the shared function into the 3rd .py file.

Raymond Kwok Over a year ago

And finally I can define a wrapper function for my_data1 that will call all the 20 functions, and a wrapper function for my_data2 that will call all the 15 functions.

|

Collectives™ on Stack Overflow

Manipulating DataFrame with custom dataclass methods

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related