1

I just accepted a similar question (PHP + MySQL Queue), but I realized that it wasn't the correct question for my problem, but was the correct answer for my question :)

I have a MySQL (MyISAM type) table of sites to be scraped by workers.

CREATE TABLE `site` (
  `id` int(11) NOT NULL auto_increment,
  `url` text,
  `last_pop` int(13) default NULL,
  `md5` varchar(32) default NULL,
  `disabled` tinyint(1) default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `md5` (`md5`),
) ENGINE=MyISAM  DEFAULT CHARSET=latin1;

What I need is to scrap one site per worker without repeat. So, if I have 3 sites and 2 workers the system needs to work like this:

      ID URL   LAST_POP
t4    1  site1 t1         <- worker1 scrap site1
t4    2  site2 t2         <- worker2 scrap site2
t5    3  site3 t3         <- worker1 scrap site3
t6    1  site2 t4         <- worker2 scrap site2
t6    2  site1 t4         <- worker1 scrap site1
t7    3  site3 t5         <- worker2 scrap site3
....

It's like a cycled queue orderer by last_pop ASC.

How can I do that?

2 Answers 2

1

You'll probably want to keep track of two pieces of information for each site: when it was last scraped and if it is currently being scraped.

Using the answer for your other question, set the scraping field to the id of the worker to lock it from other workers. When the worker is finished with the task set the scraping field back to null and the last_scrape date to the current time.

CREATE TABLE `site` (
  `id` int(11) NOT NULL auto_increment,
  `url` text,
  `last_scrape` TIMESTAMP,
  `scraping` tinyint(1) default NULL,
  `md5` varchar(32) default NULL,
  `disabled` tinyint(1) default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `md5` (`md5`),
) ENGINE=MyISAM  DEFAULT CHARSET=latin1;

Lock and retrieve the next job (the site that was last scraped the longest time ago):

Update site
  set `scraping` = '$worker_id' 
  where `scraping` is null 
  order by `last_scrape` ASC limit 1;

$job = 
  Select * from site
  where `scraping` = '$worker_id'

Release job back to queue:

Update site
  set `scraping` = NULL,
  `last_scrape` = NOW()
  where `scraping` = '$worker_id';
Sign up to request clarification or add additional context in comments.

1 Comment

you could potentially combine the use of the timestamp and scraping status, as long as their values won't conflict. If you may have 100 workers, just select the job where last_scape > 101 and set last_scrape to the worker id up to 100 to specify it's being processed. Make sure you initialize the field to a larger value as well, or they could appear already locked to all workers.
0

Why not add an extra boolean column STATUS so that you can order by LAST_POP. So when a site is selected by a worker for scrapping run a second query with UPDATE site SET status = '1'. And when the next worker selects the next site run a query with SELECT * FROM site WHERE status = '0' ORDER BY last_pop ASC.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.