0

I'm running a curl script that does data mining; for it to run through all date it takes approx 600 seconds. So I figured, If I split the load with two or three or etc... threads, I could then split that 600 seconds.

Any suggestions?

I know one way to do this is via windows scheduler, I can have it execute multiple files; but ideally I'd like to have windows scheduler execute (i.e. php-cgi thefilename.php) one single file and have that one exec multiple others.

Any suggestions? THanks,

5
  • Threading isn't a magic "make my computer faster" bullet. You can't just add another thread and reliably half the time it takes to do something. Commented Jan 13, 2011 at 0:14
  • 1
    Yeah, but a lot of that time is waiting for requests to come back - no CPU usage but the thread is blocked, so starting additional threads makes sense. Commented Jan 13, 2011 at 0:16
  • Any suggestions you'd recommend? The load time is based on cURL running scans on various pages, each which take time. So I was hoping I could run i.e. three "bots" via each on their own process (using php-cgi / php.exe) Commented Jan 13, 2011 at 0:16
  • Maybe some third-party program that downloads all data before processing? On *nix I would recommend wget, should be something for Windows also Commented Jan 13, 2011 at 0:20
  • German Rumm, that wouldn't work, the mining occurs every 5 minutes; and then directly after running it invokes a function to scan and clean that data and input into a database. Commented Jan 13, 2011 at 0:23

2 Answers 2

1

If you're stuck on windows, i.e. you don't have the pcntl extension, what I would recommend is to use curl_multi_* to execute multiple requests asynchronously. This is a good way to get more performance if your bottleneck is the server delays.

Sign up to request clarification or add additional context in comments.

3 Comments

Hmm, I forgot about curl_multi; I don't think it would work though. My script takes the content after it loads the content via cURL and runs scans on that it self. This would only work if I could run the function multiple times. I think it makes more sense to run seperate process/threads, let me know what you think.
If you have a clearly defined job queue that you can easily split, multiple processes is probably easier to do because curl_multi is not really fun to work with. On the other hand it is more flexible if you're building something like a crawler that builds its job queue along the discovery of content.
Thank you appreciate it. You're def right, if it was for directly storing / discovery content it would have worked but because I've got invoked cleaning functions that occur directly on the output it wouldn't work. Thanks again though!
0

Not really answering your question but a solution nevertheless. You could create a batch file like this:

start php-cgi thefilename.php
start php-cgi thefilename.php
start php-cgi thefilename.php

This will create three independent threads.

14 Comments

So simply adding "start" at the beginning could do that?
Yes, the start command creates a new independent thread from a command prompt.
This seems to have worked best so far. Now my next question would be, is it possible to send a variable on any of those lines? so I.e.
start php-cgi thefilename.php a=0 start php-cgi thefilename.php a=5
What I'd like to do is tell that same php file to put an offset on the array of urls within.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.