how to write endless loop crawler in python?

Question

EDITED:

I have a crawler.py that crawls certain sites every 10 minutes and sends me some emails regarding these site. The crawler is ready and working locally.

How can I adjust it so that the following two things will happen :

It will run in endless loop on the hosting that I'll upload it to?
Sometimes I will be able to stop it ( e.g. for debugging).

At first, I thought of doing endless loop e.g.

crawler.py:

while True:
    doCarwling()
    sleep(10 minutes)

However, according to answers I got below, this would be impossible since hosting providers kill processes after a while (just for the question sake, let's assume proccesses are killed every 30 min). Therefore, my endless loop process would be killed at some point.

Therefore, I have thought pf a different solution: Lets assume that my crawler is located at "www.example.com\crawler.py" and each time it is accessed, it executes the function run():

run()
     doCarwling()
     sleep(10 minutes)
     call URL "www.example.com\crawler.py"

Thus, there will be no endless loop. In fact, every time my crawler runs, it would also access the URL which will execute the same crawler again. Therefore, there would be no endless loop, no process with a long-running time, and my crawler will continue operating forever.

Will my idea work? Are there any hidden drawbacks I haven't thought of?

Thanks!

Thanks

Sharing your existing code might help, as those questions are pretty vague. — The Compiler
– The Compiler, Commented May 30, 2015 at 16:27
Well, I have a file crawler.py and inside there is one function run() that does everything.... Now I want it to run every 10 min and I also want to be able to stop it in case I need. Thanks for your help! — Yura
– Yura, Commented May 30, 2015 at 16:57
You could try what you describe, but it's definitely not going to be reliable. Also your crawling will need to be fast because the default timeout is usually around 90secs. Actually I would not do a loop and just call the page recursively and check if the current time (minutes) is % 10 == 0 or something like that. Otherwise it most likely to be killed will in sleep() and never call the next page. Please keep in mind what we're describe is REALLY UGLY, INEFICIENT and UNRELIABLE. I doest take much time to learn how to do a cronjob... — Maresh
– Maresh, Commented May 31, 2015 at 15:50
Thanks @Maresh. Actually, I already read about cronjob. I also spoke with the hosting company and they said that I can not run cronjjob on shared hosting, but only on Virtual Private Server, which would cost me approximately 4 times more expensive (~40$/month instead of 10$/month). Since my crawler is not going to give me any profits, I am not too entusiastic to invest any money it its hosting. This is why I am looking for some other alternatives. — Yura
– Yura, Commented May 31, 2015 at 16:01
By the way, can I use scheduled tasks on my windows PC to access the carawler-url constantly? Will it work? — Yura
– Yura, Commented May 31, 2015 at 16:16

AkiRoss · Accepted Answer · 2015-06-01 15:49:48Z

2

As you stated in the comments, you are running on a public shared server like GoDaddy and so on. Therefore cron is not available there and long running scripts are usually forbidden - your process would be killed even if you were using sleep.

Therefore, the only solution I see is to use an external server on which you have to control to connect to your public server and run the script, every 10 minutes. One solution could be using cron on your local machine to connect with wget or curl to a specific page on your host. **

Maybe you can find on-line services that allow running a script periodically, and use those, but I know none.

** Bonus: you can get the results directly as response without having to send yourself an email.

Update

So, in your updated question you propose yo use your script to call itself with an HTTP request. I thought of it before, but I didn't consider it in my previous answer because I believe it won't work (in general).

My concern is: will the server kill a script if the HTTP connection requesting it is closed before the script terminates?

In other words: if you open yoursite.com/script.py and it takes 60 seconds to run, and you close the connection with the server after 10 seconds, will the script run till its regular end?

I thought that the answer was obviously "no, the script will be killed", therefore that method would be useless, because you should guarantee that a script calling itself via a HTTP request stays alive longer than the called script. I did a little experiment using flask, and it proved me wrong:

from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello_world():
    import time
    print('Script started...')
    time.sleep(5)
    print('5 seconds passed...')
    time.sleep(5)
    print('Script finished')
    return 'Script finished'

if __name__ == '__main__':
    app.run()

If I run this script and make an HTTP request to localhost:5000, and close the connection after 2 seconds, the scripts continues to run until the end and the messages are still printed.

Therefore, with flask, if you can do an asynchronous request to yourself, you should be able to have an "infinite loop" script.

I don't know the behavior on other servers, though. You should make a test.

Control

Assuming your server allows you to do a GET request and have the script running even if the connection is closed, you have few things to take care of, for example that your script still has to run fast enough to complete during the maximum server time allowance, and that to make your script run every 10 minutes, with a maximum allowance of 1 minute, you have to count every time 10 calls.

In addition, this mechanism has to be controlled, because you cannot interrupt it for debug as you requested. At least, not directly.

Therefore, I suggest you to use files: use a file to split your crawling in smaller steps, each capable to finish in less than one minute, and then continue again when the script is called again.

Use a file to count how many times the script is called, before actually doing the crawling. This is necessary if, for example, the script is allowed to live 90 seconds, but you want to crawl every 10 hours.

Use a file to control the script: store a boolean flag that you use to stop the recursion mechanism if you need to.

edited Jun 1, 2015 at 15:49

answered May 30, 2015 at 17:06

AkiRoss

12.4k7 gold badges63 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Yura Over a year ago

Thanks for your detailed answer. Well, such solution would require my localhost to run forever and enter a URL (to activate the script) every 10 min... This would be not a preferd solution, since it will mean that I have to turn a machine into a dedicated for that. Is there any other more "friendly" solution? :(

Maresh Over a year ago

Agreed that the only thing possible if he wants to stick the a web-application pattern. Although it is very silly to have two machines running something that could be done with one.

AkiRoss Over a year ago

I updated my answer. But, as @Maresh said, this is not an efficient or reliable solution (I won't call it ugly, though: is like recursion :)

FBidu · Accepted Answer · 2015-05-30 16:29:52Z

1

If you're using Linux you should just do a cron job for your script. Info: http://code.tutsplus.com/tutorials/scheduling-tasks-with-cron-jobs--net-8800

answered May 30, 2015 at 16:29

FBidu

1,02211 silver badges22 bronze badges

6 Comments

Yura Over a year ago

It is not my server I am using for hosting. Just some shared hosting provider (e.g. GoDaddy or something...).

AkiRoss Over a year ago

Watch out because the server may kill long running processes: usually there is a limit on the time a single program can run.

Yura Over a year ago

Yeah... What could I do?

FBidu Over a year ago

Well, I have a GoDaddy shared host and it allows me to easily create cron jobs through a cPanel interface. That said, your resolution - a script that calls itself after being run - seems to work fine to me... If your serve is going to kill things based on run time, it'll probably kill by the ID of the process so your script will be ok. Given that this technique may run differently according to server setup, I'd make a test before running the real thing. You could, by example, make a script that only sends you an email and check if it will work by calling itself.

Yura Over a year ago

#@FBidu, Thanks, I'll try your suggestion regarding the script. Good to know that GoDaddy allows this. Do you have a shared hosting on GoDaddy? What sort of package do you have?

|

Maresh · Accepted Answer · 2015-05-30 16:32:24Z

1

If you are running linux I would setup and upstart script http://upstart.ubuntu.com/getting-started.html to turn it into a service. It offers a lot of advantages like: -Starting at system boot -Auto restart on crashes -Manageable: service mycrawler restart ...

Or if you would prefer to have it run every 10 minutes forget about the endless loop and do a cronjob http://en.wikipedia.org/wiki/Cron

answered May 30, 2015 at 16:32

Maresh

4,7221 gold badge27 silver badges30 bronze badges

9 Comments

Yura Over a year ago

Thanks @Maresh. It is not my server I asm using for hosting, there fore I cant run any software. Just some shared hosting provider (e.g. GoDaddy or something...). I can only uplad html/js/css/php/pu/.... files.

Maresh Over a year ago

Well it is tricky then, because web applications are not meant to run endless loops...

Maresh Over a year ago

The problem is that the web server that runs you application has a timeout for its subprocess, and you don't have control over that. You need a dedicated server to do what you want. Or you could call the url of your script every 10min from another computer, but it's kind of stupid because you need 2 machines then...

Maresh Over a year ago

As I said if it is behind an http server (like apache) there's nothing you can do. Unless your hosting provider provides some sort of a cron jobs API. What's your provider if I may?

Maresh Over a year ago

Ah, well then you should definitely go for a linux virtual private server. And not just a web hosting. It is more administration but you can do much more, including what you want.

|

Collectives™ on Stack Overflow

how to write endless loop crawler in python?

3 Answers 3

Update

Control

3 Comments

6 Comments

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Update

Control

3 Comments

6 Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related