12

Often I find myself running into the same question. A common pattern is that I create a class that performs some operations. Eg. Loads data, transforms/cleans data, saves data. The question then arises how to pass/save intermediate data. Look at the following 2 options:

import read_csv_as_string, store_data_to_database

class DataManipulator:
    ''' Intermediate data states are saved in self.results'''

    def __init__(self):
        self.results = None

    def load_data(self):
        '''do stuff to load data, set self.results'''
        self.results = read_csv_as_string('some_file.csv')

    def transform(self):
        ''' transforms data, eg get first 10 chars'''
        transformed = self.results[:10]
        self.results = transformed

    def save_data(self):
        ''' stores string to database'''
        store_data_to_database(self.results)

    def run(self):
        self.load_data()
        self.transform()
        self.save_data()

DataManipulator().run()

class DataManipulator2:
    ''' Intermediate data states are not saved but passed along'''


    def load_data(self):
        ''' do stuff to load data, return results'''
        return read_csv_as_string('some_file.csv')

    def transform(self, results):
        ''' transforms data, eg get first 10 chars'''
        return results[:10]

    def save_data(self, data):
        ''' stores string to database'''
        store_data_to_database(data)

    def run(self):
        results = self.load_data()
        trasformed_results = self.transform(results)
        self.save_data(trasformed_results)

DataManipulator2().run()

Now for writing tests, I find DataManipulator2 better since functions can be tested more easily in isolation. At the same time I also like the clean run function of DataManipulator. What is the most pythonic way?

3
  • With the strictly sequential processing that you show, then DataManipulator2 is cleaner. If, however, you ever need any of the intermediate results, then the first way will have to be used. Commented Apr 16, 2019 at 10:42
  • 3
    If you get a DataManipulator object, you don't necessarily know if self.results is the raw loaded data or the transformed data. That seems like it could lead to reasonable-looking code that has unexpected results. Commented Apr 29, 2019 at 20:46
  • 1
    Assuming this is under a <something>.py module, do you have other types of objects with methods load_data, transform, save_data and run? If not, I'd strongly suggest just using functions. IMHO, OOP doesn't make sense if there is no need for polymorphism. Commented May 1, 2019 at 13:35

3 Answers 3

11
+50

Unlike what was said in the other answers, I don't think this is a matter of personal taste.

As you wrote, DataManipulator2 seems, at first sight, easier to test. (But as @AliFaizan stated, it's not so easy to unit test a function that needs a database connection.) And it seems easier to test because it's stateless. A stateless class is not automatically easier to test, but it is easier to understand: for one input, you always get the same output.

But that's not the only point: with DataManipulator2, the order of the actions in run can't be wrong, because each function passes some data to the next one, and the next one can't proceed without this data. That would be more obvious with a statically (and strongly) typed language, because you can't even compile a erroneous run function.

On the contrary, DataManipulator is not easily testable, stateful and doesn't ensure the order of the actions. That's why the method DataManipulator.run is so clean. It's event too clean because its implementation hides something very important: function calls are ordered.

Hence, my answer: prefer the DataManipulator2 implementation to the DataManipulator implementation.

But is the DataManipulator2 perfect? Yes and no. For a quick and dirty implementation, that's the way to go. But let's try to go further.

You need the function run to be public, but load_data, save_data and transform have no reason to be public (by "public" I mean: not marked as implementation detail with an underscore). If you mark them with an underscore, they are not part of the contract anymore and you are not comfortable with testing them. Why? Because the implementation may change without breaking the class contract although there may be tests failures. That's a cruel dilemma: either your class DataManipulator2 has the correct API or it is not fully testable.

Nevertheless, these functions should be testable, but as part of the API of another class. Think of a 3-tier architecture:

  • load_data and save_data are in the data layer
  • transform is in the business layer.
  • the run call is in the presentation layer

Let's try to implement this:

class DataManipulator3:
    def __init__(self, data_store, transformer):
        self._data_store = data_store
        self._transformer = transformer

    def run(self):
        results = self._data_store.load()
        trasformed_results = self._transformer.transform(results)
        self._data_store.save(transformed_results)

class DataStore:
    def load(self):
        ''' do stuff to load data, return results'''
        return read_csv_as_string('some_file.csv')

    def save(self, data):
        ''' stores string to database'''
        store_data_to_database(data)

class Transformer:
    def transform(self, results):
        ''' transforms data, eg get first 10 chars'''
        return results[:10]

DataManipulator3(DataStore(), Transformer()).run()

That's not bad, and the Transformer is easy to test. But:

  • the DataStore is not handy: the file to read is buried in the code and the database too.
  • the DataManipulator should be able to run a Transformer on multiple data samples.

Hence another version that adresses these issues:

class DataManipulator4:
    def __init__(self, transformer):
        self._transformer = transformer

    def run(self, data_sample):
        data = data_sample.load()
        results = self._transformer.transform(data)
        self.data_sample.save(results)

class DataSample:
    def __init__(self, filename, connection)
        self._filename = filename
        self._connection = connection

    def load(self):
        ''' do stuff to load data, return results'''
        return read_csv_as_string(self._filename)

    def save(self, data):
        ''' stores string to database'''
        store_data_to_database(self._connection, data)

with get_db_connection() as conn:
    DataManipulator4(Transformer()).run(DataSample('some_file.csv', conn))

There's one more point: the filename. Try to prefer file-like object to filenames as arguments, because you can test your code with the io module:

class DataSample2:
    def __init__(self, file, connection)
        self._file = file
        self._connection = connection

    ...

dm = DataManipulator4(Transformer())
with get_db_connection() as conn, open('some_file.csv') as f:
    dm.run(DataSample2(f, conn))

With mock objects, it's now very easy to test the behaviour of the classes.

Let's summarize the advantages of this code:

  • the order of actions is ensured (as in DataManipulator2)
  • the run method is as clean as it should be (as in DataManipulator2)
  • the code is modular: you can create a new Transformer, or a new DataSample (load from a DB and save to a csv file for instance)
  • the code is testable: every method is public (in Python sense) but the APIs remain simple.

Of course, this is really (old style) Java-like. In python, you can simply pass the function transform instead of an instance of the Transformer class. But as soon as your transform begins to be complex, a class is a good solution.

Sign up to request clarification or add additional context in comments.

1 Comment

Very thorough answer. To me the most beneficial one. Thanks a lot!
3

What is the most pythonic way?

Python supports multiple paradigms. Your second form is closer to functional, the first one is more imperative. It is strictly a matter of preference, without context.

I have a 3rd suggestion, though, because I like when objects aren't stateful when it can be avoided. This is easily testable and avoids problems of all sorts in a complex faulty run() method (e.g. transform before load, transform called twice, save without transform etc.).


class DataTransformer:
    @classmethod
    def from_csv(cls, some_file):
        '''Because I don't like __init__ to do logic, it's harmful for testability, 
        but at the same time this is needed data for proper initialization
        '''
        return cls(read_csv_as_string(some_file))

    def __init__(self, raw_data):
        ''' Feel free to init with bogus test data '''
        self.raw_data = raw_data

    def transform(self):
        ''' Returning the data instead of a ContentSaver is a less coupled design (suppose you add more exporters)'''
        return self.raw_data[:10]

class ContentSaver:
    '''Having a different class makes sense now the data is transformed:
    it's a different type of data, from a logical standpoint.'''
    def __init__(self, some_content):
        self.content = some_content

    def save_data(self):
        store_data_to_database(self.content)

def run():
    '''Note this code part isn't easily testable, so it's better if possible mistakes are made fewer.'''
    transformer = DataTransformer.from_csv('some_file')
    writer = ContentSaver(transformer.transform())
    # Possible further uses of transformer and writer without care of order
    writer.save_data()

At any point during my objects' lifetime, they hold initialized data of consistent type. This makes them testable, less error prone and more likely to be useful in different implementations (not just run()).

Because of all the listed benefits, I'd say it's worth writing a class at each structuring step of your pipeline (DataCleaner and so on) as it will be easier to maintain as your code grows.

2 Comments

+1 since the classes do what they say (e.g. DataTransformer only transforms data and the ContentSaver saves it) and they are very well unit testable. These are important requirements when writing software, IMHO.
Agree with @wovano. It's the S in the SOLID principle.
0

I won't go into functional vs imperative style. If a language provides you a feature which makes your life easier, use it, regardless of philosophy behind.
You said you find DataManipulator2 easier to test. I don't agree with that. For example in function save_data, you will pass data as input in DataManipulator2. In DataManipulator you will have to use it as fixture. Have a look at two most famous python testing libraries pytest and unittest to explore different styles of writing tests.
Now I see there are two things you need to consider. First ease of your own use. You mentioned you find DataManipulator more cleaner. It shows this way is more natural to you and possibly you team. No matter how many times I say DataManipulator2 is easier and cleaner, It will be you how will modify, maintain and explain code to other people. So go with the approach most clear to you. Second important thing you should consider is how close you code is coupled with your data (I don't believe any approach is wrong).
In first approach whenever you will perform any action, it will* change your state (not ALWAYS ture. Your functions can perform action on self.result and give output without state change). You can look at it like you are editing a file with auto save enabled. Only difference is you can't undo (at least not with ctr/cmd + z). In 2nd option you or user of your class will decide if they want to save. A bit more work may be but freedom for both creator and user of class.
Verdict: Define purpose of your class, it's responsibilities and overall structure of your code. If it is data oriented class e.g. data python 3.7 data classes with Frozen=False, go with first approach. If it is service style class (think of it as REST api for other parts of code), go with second approach.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.