Unlike what was said in the other answers, I don't think this is a matter of personal taste.
As you wrote, DataManipulator2 seems, at first sight, easier to test. (But as @AliFaizan stated, it's not so easy to unit test a function that needs a database connection.) And it seems easier to test because it's stateless. A stateless class is not automatically easier to test, but it is easier to understand: for one input, you always get the same output.
But that's not the only point: with DataManipulator2, the order of the actions in run can't be wrong, because each function passes some data to the next one, and the next one can't proceed without this data. That would be more obvious with a statically (and strongly) typed language, because you can't even compile a erroneous run function.
On the contrary, DataManipulator is not easily testable, stateful and doesn't ensure the order of the actions. That's why the method DataManipulator.run is so clean. It's event too clean because its implementation hides something very important: function calls are ordered.
Hence, my answer: prefer the DataManipulator2 implementation to the DataManipulator implementation.
But is the DataManipulator2 perfect? Yes and no. For a quick and dirty implementation, that's the way to go. But let's try to go further.
You need the function run to be public, but load_data, save_data and transform have no reason to be public (by "public" I mean: not marked as implementation detail with an underscore). If you mark them with an underscore, they are not part of the contract anymore and you are not comfortable with testing them. Why? Because the implementation may change without breaking the class contract although there may be tests failures. That's a cruel dilemma: either your class DataManipulator2 has the correct API or it is not fully testable.
Nevertheless, these functions should be testable, but as part of the API of another class. Think of a 3-tier architecture:
load_data and save_data are in the data layer
transform is in the business layer.
- the
run call is in the presentation layer
Let's try to implement this:
class DataManipulator3:
def __init__(self, data_store, transformer):
self._data_store = data_store
self._transformer = transformer
def run(self):
results = self._data_store.load()
trasformed_results = self._transformer.transform(results)
self._data_store.save(transformed_results)
class DataStore:
def load(self):
''' do stuff to load data, return results'''
return read_csv_as_string('some_file.csv')
def save(self, data):
''' stores string to database'''
store_data_to_database(data)
class Transformer:
def transform(self, results):
''' transforms data, eg get first 10 chars'''
return results[:10]
DataManipulator3(DataStore(), Transformer()).run()
That's not bad, and the Transformer is easy to test. But:
- the
DataStore is not handy: the file to read is buried in the code and the database too.
- the
DataManipulator should be able to run a Transformer on multiple data samples.
Hence another version that adresses these issues:
class DataManipulator4:
def __init__(self, transformer):
self._transformer = transformer
def run(self, data_sample):
data = data_sample.load()
results = self._transformer.transform(data)
self.data_sample.save(results)
class DataSample:
def __init__(self, filename, connection)
self._filename = filename
self._connection = connection
def load(self):
''' do stuff to load data, return results'''
return read_csv_as_string(self._filename)
def save(self, data):
''' stores string to database'''
store_data_to_database(self._connection, data)
with get_db_connection() as conn:
DataManipulator4(Transformer()).run(DataSample('some_file.csv', conn))
There's one more point: the filename. Try to prefer file-like object to filenames as arguments, because you can test your code with the io module:
class DataSample2:
def __init__(self, file, connection)
self._file = file
self._connection = connection
...
dm = DataManipulator4(Transformer())
with get_db_connection() as conn, open('some_file.csv') as f:
dm.run(DataSample2(f, conn))
With mock objects, it's now very easy to test the behaviour of the classes.
Let's summarize the advantages of this code:
- the order of actions is ensured (as in
DataManipulator2)
- the
run method is as clean as it should be (as in DataManipulator2)
- the code is modular: you can create a new
Transformer, or a new DataSample (load from a DB and save to a csv file for instance)
- the code is testable: every method is public (in Python sense) but the APIs remain simple.
Of course, this is really (old style) Java-like. In python, you can simply pass the function transform instead of an instance of the Transformer class. But as soon as your transform begins to be complex, a class is a good solution.
DataManipulatorobject, you don't necessarily know ifself.resultsis the raw loaded data or the transformed data. That seems like it could lead to reasonable-looking code that has unexpected results.<something>.pymodule, do you have other types of objects with methodsload_data,transform,save_dataandrun? If not, I'd strongly suggest just using functions. IMHO, OOP doesn't make sense if there is no need for polymorphism.