2

I'm writing a simple site spider and I've decided to take this opportunity to learn something new in concurrent programming in Python. Instead of using threads and a queue, I decided to try something else, but I don't know what would suit me.

I have heard about Stackless, Celery, Twisted, Tornado, and other things. I don't want to have to set up a database and the whole other dependencies of Celery, but I would if it's a good fit for my purpose.

My question is: What is a good balance between suitability for my app and usefulness in general? I have taken a look at the tasklets in Stackless but I'm not sure that the urlopen() call won't block or that they will execute in parallel, I haven't seen that mentioned anywhere.

Can someone give me a few details on my options and what would be best to use?

Thanks.

1
  • 1
    celery doesn't require a database, but it does require either RabbitMQ, a database supported by the Django ORM, or Redis. Where RabbitMQ is the preferred option. While celery is definitely useful for distributed computing, it's main purpose is queueing asynchronous tasks for websites. You could use celery as an option in your program, which if enabled adds parallelization. Or if you have lots of data to analyze, maybe you should look into Hadoop? Commented Feb 15, 2010 at 14:37

3 Answers 3

4

Tornado is a web server, so it wouldn't help you much in writing a spider. Twisted is much more general (and, inevitably, complex), good for all kinds of networking tasks (and with good integration with the event loop of several GUI frameworks). Indeed, there used to be a twisted.web.spider (but it was removed years ago, since it was unmaintained -- so you'll have to roll your own on top of the facilities Twisted does provide).

Sign up to request clarification or add additional context in comments.

Comments

2

I must say that Twisted gets my vote.

Performing event-drive tasks is fairly straightforward in Twisted. Integration with other important system components such as GTK+ and DBus is very easy.

The HTTP client support is basic for now but improving (>9.0.0): see related question.

The added bonus is that Twisted is available in the Ubuntu default repository ;-)

2 Comments

Hmm, I saw that and it looks very interesting, apparently I can fire off N requests and have each callback fire one more, that would (hopefully) keep the number of requests constant. Thanks for this!
here are my "search wiki notes" on Twisted: google.ca/…
1

For a quick look at package sizes, see ohloh.net/p/compare .
Of course source size is only a rough metric (what I'd really like is nr pages doc, nr pages examples, dependencies), but it can help.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.