Coordinating distributed Python processes using queuing or REST web service

Question

Server A has a process that exports n database tables as flat files. Server B contains a utility that loads the flat files into a DW appliance database.

A process runs on server A that exports and compresses about 50-75 tables. Each time a table is exported and a file produced, a .flag file is also generated.

Server B has a bash process that repeatedly checks for each .flag file produced by server A. It does this by connecting to A and checking for the existence of a file. If the flag file exists, Server B will scp the file from Server A, uncompress it, and load it into an analytics database. If the file doesn't yet exist, it will sleep for n seconds and try again. This process is repeated for each table/file that Server B expects to be found on Server A. The process executes serially, processing a single file at a time.

Additionally: The process that runs on Server A cannot 'push' the file to Server B. Because of file-size and geographic concerns, Server A cannot load the flat file into the DW Appliance.

I find this process to be cumbersome and just so happens to be up for a rewrite/revamp. I'm proposing a messaging-based solution. I initially thought this would be a good candidate for RabbitMQ (or the like) where

Server A would write a file, compress it and then produce a message for a queue.
Server B would subscribe to the queue and would process files named in the message body.

I feel that a messaging-based approach would not only save time as it would eliminate the check-wait-repeat cycle for each table, but also permit us to run processes in parallel (as there are no dependencies).

I showed my team a proof-of-concept using RabbitMQ and they were all receptive to using messaging. A number of them quickly identified other opportunities where we would benefit from message-based processing. One such area that we would benefit from implementing messaging would be to populate our DW dimensions in real-time rather then through batch.

It then occurred to me that a MQ-based solution might be overkill given the low volume (50-75 tasks). This might be overkill given our operations team would have to install RabbitMQ (and its dependencies, including Erlang), and it would introduce new administration headaches.

I then realized this could be made more simple with a REST-based solution. Server A could produce a file and then make a HTTP call to a simple (web.py) web service on Server B. Server B could then initiate the transfer-and-load process based on the URL that is called. Given the time that it takes to transfer, uncompress, and load each file, I would likely use Python's multiprocessing to create a subprocess that loads each file.

I'm thinking that the REST-based solution is idea given the fact that it's simpler. In my opinion, using an MQ would be more appropriate for higher-volume tasks but we're only talking (for now) 50-75 operations with potentially more to come.

Would REST-based be a good solution given my requirements and volume? Are there other frameworks or OSS products that already do this? I'm looking to add messaging without creating other administration and development headaches.

update thanks everyone - I've sided with Zookeeper for this project! — Neil Kodner
– Neil Kodner, Commented Jan 11, 2012 at 0:12

wberry · Accepted Answer · 2011-11-28 19:33:31Z

Message brokers such as Rabbit contain practical solutions for a number of problems:

multiple producers and consumers are supported without risk of duplication of messages
atomicity and unit-of-work logic provide transactional integrity, preventing duplication and loss of messages in the event of failure
horizontal scaling--most mature brokers can be clustered so that a single queue exists on multiple machines
no-rendezvous messaging--it is not necessary for sender and receiver to be running at the same time, so one can be brought down for maintenance without affecting the other
preservation of FIFO order

Depending on the particular web service platform you are considering, you may find that you need some of these features and must implement them yourself if not using a broker. The web service protocols and formats such as HTTP, SOAP, JSON, etc. do not solve these problems for you.

In my previous job the project management passed on using message brokers early on, but later the team ended up implementing quick-and-dirty logic meant to solve some of the same issues as above in our web service architecture. We had less time to provide business value because we were fixing so many concurrency and error-recovery issues.

So while a message broker may seem on its face like a heavyweight solution, and may actually be more than you need right now, it does have a lot of benefits that you may need later without yet realizing it.

Rich Kroll · Accepted Answer · 2011-11-29 21:39:33Z

2

As wberry alluded to, a REST or web-hook based solution can be functional but will not be very tolerant to failure. Paying the operations price up front for messaging will pay long term dividends as you will find additional problems which are a natural fit for the messaging model.

Regarding other OSS options; If you are considering stream based processing in addition to this specific use case, I would recommend taking a look at Apache Kafka. Kafka provides similar messaging semantics to RabbitMQ, but is tightly focused on processing message streams (not to mention that is has been battle tested in production at LinkedIn).

answered Nov 29, 2011 at 21:39

Rich Kroll

4,0153 gold badges26 silver badges28 bronze badges

Collectives™ on Stack Overflow

Coordinating distributed Python processes using queuing or REST web service

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related