3

I have an application running in Python 3.9.4 where I store class objects in sets (along with many other kinds of objects). I'm getting non-deterministic behavior even when PYTHONHASHSEED=0 because class objects get non-deterministic hash codes. I assume that's because class objects' hash codes come from their addresses in memory.

For example, here are two runs of a little test program, where Before and Equation are classes:

print(hash(Before), hash(Equation), hash(int))
304555224 304593057 271715397

print(hash(Before), hash(Equation), hash(int))
326601328 293027788 273337413

How can I get Python to generate deterministic hash values for class objects?

Is there a metaclass or something that I could monkey-patch so that all class objects, even int, get a hash function that I specify?

4
  • implement a deterministic __hash__() method in your classes Commented Nov 17, 2021 at 8:34
  • @JanWilamowski Wouldn't that affect only the hash values of the class instances? Commented Nov 17, 2021 at 14:50
  • ah, I mistook what you call "class objects" as "instances". You have to implement __hash__() on your classes' metaclass, but I haven't gotten it to work yet Commented Nov 19, 2021 at 10:01
  • @JanWilamowski Ah, that gives me hope that there might be a way to do it. I've been wondering if it's a bug in Python that I should report. Commented Nov 19, 2021 at 14:20

1 Answer 1

1

Hash for classes is deterministic within the same process . Yes, in cPython it is memory based - but then you can't simply "move" a class object to another memory address using Python code.

If you happen to use some serialization/de-serialization transforms with the classes, the de-serialized objects will ordinarily be new objects, distinct from the original ones, and therefore will hash differently.

For the note: I could not reproduce the behavior you stated in the question: on the same process, the hashes for the class objects will be the same.

If you are calculating the hashes in different processes, though, the will differ. So, although you don't mention multiprocessing there, I assume that is your working case.

Then, indeed, implementing __hash__ and __eq__ proper methods on the metaclass can allow you a stable, across process, hashing - but you can't do that with built-in classes such as int: those are built in native code and can't be changed on the Python side. On the other hand, despite the hash number shown being different for these built-in classes, whatever you are using to serialize/deserialize your classes (that is what Python does for communicating data across processes, even if you don't do any explicit de/serializing) .

Then we come to, while it is straightforward to add __eq__ and __hash__ methods to a metaclass to your classes, it'd be better to ensure that on deserializing, it would always yield the same object (with the same ID). hash stability, as you put it, could possibly ensure you have always the same class, but it would depend on how you'd write your code: it is a bit tricky to retrieve the object instance that is already inside a set, if you check positively for containship of another instance that matches it - the most straightfoward way would be building a identity-dictionary out of a set, and then use the value:

my_registry_dict = {element: element for element in my_registry_set}
my_class = my_registry_dict[incoming_class]

With this in mind, we can have a custom metaclass that not only add __eq__ and __hash__- and you have to pick what elements of the classes you will want to compare for equality - class.__qualname__ can be a simple and functional attribute to use - but also customize the __new__ method so that upon de-serializing the same class a second time will always re-use the first class object defined in the current process (i.e.: ensuring the "singleton" behavior Python classes enjoy in non-corner cases like yours seems to be)

class Meta(type):
    registry = {}
    def __new__(mcls, name, bases, namespace):
        cls = super().__new__(mcls, name, bases, namespace)
        if cls not in mcls.registry:
            mcls.registry[cls] = cls
        else:
            # reuse the previously created class
            cls = mcls.registry[cls]
        return cls

    def __hash__(cls):
        # when working with metaclasses, using the name `cls` instead of `self``
        # helps reminding us that we are dealing with instances that are
        # actually classes.
        return hash(cls.__qualname__)

    def __eq__(cls, other):
        return cls.__qualname__ == other.__qualname__


Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for the thoughtful, extensive answer. I need to think about what you've said here to determine if can solve my problem. I need the program to produce the same results each time I run it, not only on one machine but all machines. For example, if I put some objects into a Set, then when I iterate through the Set, I need the objects to come out in the same sequence each time I run the program.
If you need the same sequence across runs, then "set" is not your data structure. Set is not designed to be deterministic in Python, and trying to work around it by forcing the hash seed is not the way to go. If you need a deterministic order, pick a criteria for sorting, and convert your set to a list with a sorted call before any output. It is also possible to create a "sorted set" class that could maintain an internal, parallel list or dict, and reproduce the input order on fetching elements. If you need further help, please state your requirements and post a follow up question.
"Hash for classes is deterministic within the same process ." -- is this documented somewhere?
It derives, by default, that if they once they are hasheable, the hash value can't change during the object lifetime. So, yes, it is "documented" in the explanation of how hashing works in Python. (Actually "shouldn't change' - as one is free to write broken code creating a class which instances have a variable hash - but core objects to the language will follow this rule)
I meant class objects rather than instance objects. Like I said, is this documented somewhere? I don't see anything about how the default __hash__ implementation behaves with respect to class objects -- docs.python.org/3/reference/datamodel.html#object.__hash__
Yes, I know what you mean. The answer is still the same: there is no exception to the hashing behavior due to an object being a class: the default hash is the object ID and that can't change for a class.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.