How to safely use the file system as a sort of shared memory in Python?

Question

TLDR: Script A creates a directory and writes files in it. Script B periodically checks that directory. How does script B know when script A is done writing so that it can access the files?

I have a Python script (call it the render server) that receives requests to generate images and associated data. I need to run a separate Python application (call it the consumer) that makes use of this data . The consumer does not know when new data will be available. Ideally it should not have to know of the presence of script A, just that data somehow becomes available.

My quick and dirty solution is to have an outputs directory known to both Python scripts. In that directory, the render server creates timestamped directories and saves several files within those directories.

The render server does something like:

os.makedirs('outputs/' + timestamped_subdir)
# Write files into that directory.

The consumer checks that directory kind of like:

dirs = set()
while True:
    new_dirs = set(glob('outputs/*')).difference(dirs)
    if not len(new_dirs):
        continue
    # Do stuff with the contents of the latest new directory.

The problem is that the consumer checks the contents of the directory before the render server finishes writing (and this is evident in a FileNotFoundError). I tried to fix this by making the render server do:

os.makedisr('temp')
# Write files into that directory.
shutil.copytree('temp', 'outputs/' + timestamped_subdir)

But the consumer is still able to know of the presence of the timestamped_subdir before the files within are done being copied (again there's a FileNotFoundError). What's one "right" way to do what I'm trying to achieve?

Note: While writing this I realised I should do shutil.move instead of shutil.copytree and that seems to have fixed it. But I'm still not sure enough of the underlying mechanisms of that operation to know for sure that it works correctly.

Do you have control over the "render server" ? If so change its code to write to 'outputs/' + timestamped_subdir + '_temp' . When the "render server" is finished with that directory, change it to do an os.rename('outputs/' + timestamped_subdir + '_temp', 'outputs/' + timestamped_subdir) . That rename will be atomic as long as everything resides on the same filesystem. Now your other process just have to ignore the directories ending in _temp , and when it sees another folder, it'll know those are finished and complete. If you can't change the"render server", it's a whole different issue — nos
– nos, Commented Jan 3, 2023 at 14:24
@nos Yes I can do that. Is this not what I effecitvely described at the end of my post though? I use shutil.move which I believe is the same as os.rename. And if the answer is "yes it is the same", cool. Just want to know that others believe this is a solid solution. — Alexander Soare
– Alexander Soare, Commented Jan 3, 2023 at 14:26
its a solid solution if no communication can happen except through the filesystem, if some communication is allowed then there should be a "supervisor" process which worker processes report to when finishing tasks, and it will notify the "render" process that a change has been made (all through pipes/queues) in order to start processing it, this would be the case if you need to act on the files as soon as they are created, but for your case, changing names is probably the best fix, as timing doesn't seem critical. — Ahmed AEK
– Ahmed AEK, Commented Jan 3, 2023 at 14:32
@AlexanderSoare Yes, that would be the same, shutil.move() will just do an os.rename() in your case. Though shutil.move() will do a bit of magic and can do non-atomic file operations depending on whether the destination directory already exists, or if the source and destination directory are on different filesystems, whereas directly calling os.rename() allows you to handle those as error cases. — nos
– nos, Commented Jan 3, 2023 at 14:38
Instead of using the filesystem, use a database, that's what they are designed for (concurrent access). It also works with image that you cannot really store in some databases: you store only the uri (folder and file name) in the database once the file is created and you mark in the database that the file has been used. — Coding thermodynamist
– Coding thermodynamist, Commented Jan 3, 2023 at 15:23

nos · Accepted Answer · 2023-01-03 17:55:19Z

1

One common way to handle communication through the file system is to rely on atomic renames or linking of files or folders.

Change your "render server" to write to a folder named e.g.

outputs/' + timestamped_subdir + '_temp/'

When the "render server" is finished with that directory, change it to do an

os.rename('outputs/' + timestamped_subdir + '_temp',  'outputs/' + timestamped_subdir)

That rename will be atomic as long as everything resides on the same filesystem.

Now your other process just have to ignore the directories ending in _temp , and when it sees another folder, it'll know those are finished and complete.

answered Jan 3, 2023 at 17:55

nos

231k60 gold badges436 silver badges516 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to safely use the file system as a sort of shared memory in Python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related