1

I'm developing a simple crawler for web pages. I've searched an found a lot of solutions for implementing multi-threaded crawlers. What is is the best way to create a thread-safe queue to contain unique URLs?

EDIT: Is there a better solution in .Net 4.5?

4
  • possible duplicate of Classes in .Net 4.5 for writing a multi-thread C# crawler Commented Apr 10, 2012 at 10:49
  • 1
    OK! So I go there and post a question, few people vote for closing because it's not in ONE area. I come here and post it in ONE area, now you say it's a duplicate! I think whatever I do, some people want to try to close questions. It's easier than answering, right?! Commented Apr 10, 2012 at 10:53
  • 1
    You should consider deleting your old question that covers multiple areas. That way, this one won't be closed as a duplicate of the other question :) Commented Apr 10, 2012 at 11:01
  • Oh, I'm sorry. I didn't know I can delete a question :D Thanks. Funny, I've spent so much time on SO :D Commented Apr 10, 2012 at 11:03

5 Answers 5

2

Use the Task Parallel Library and use the default scheduler which uses ThreadPool.


OK, this is a minimal implementation which queues 30 URLs at a time:

    public static void WebCrawl(Func<string> getNextUrlToCrawl, // returns a URL or null if no more URLs 
        Action<string> crawlUrl, // action to crawl the URL 
        int pauseInMilli // if all threads engaged, waits for n milliseconds
        )
    {
        const int maxQueueLength = 50;
        string currentUrl = null;
        int queueLength = 0;

        while ((currentUrl = getNextUrlToCrawl()) != null)
        {
            string temp = currentUrl;
            if (queueLength < maxQueueLength)
            {
                Task.Factory.StartNew(() =>
                    {
                        Interlocked.Increment(ref queueLength);
                        crawlUrl(temp);
                    }
                    ).ContinueWith((t) => 
                    {
                        if(t.IsFaulted)
                            Console.WriteLine(t.Exception.ToString());
                        else
                            Console.WriteLine("Successfully done!");
                        Interlocked.Decrement(ref queueLength);
                    }
                    );
            }
            else
            {
                Thread.Sleep(pauseInMilli);
            }
        }
    }

Dummy usage:

    static void Main(string[] args)
    {
        Random r = new Random();
        int i = 0;
        WebCrawl(() => (i = r.Next()) % 100 == 0 ? null : ("Some URL: " + i.ToString()),
            (url) => Console.WriteLine(url),
            500);

        Console.Read();

    }
Sign up to request clarification or add additional context in comments.

11 Comments

What about the new .Net 4.5? Is there a better solution in .Net 4.5? And could you please post a sample?
@AlirezaNoori 4.5 is not officially out yet so how does that help you? I am not aware of any new classes that can help although async and wait keywords will help.
I'm developing this app for my research. So it's not a problem. I have used async coding in Windows 8 but do you think using async is better than multithreading?
@AlirezaNoori for doing a web crawler, you should really use async rather than multithreading as the majority of your time will be spent waiting for web pages. However, the async (certainly prior to 4.5) can be a bit complex to write, so whether the additional complexity is worth it depends on a lot of factors, including whether monopolising a lot of threads is a problem. it's a complex question and worth doing a lot of research into to understand fully.
I can easily use .Net 4.5 so I guess I can use async. Thanks. I'm going to give it a try.
|
2

ConcurrentQueue is indeed the framework's thread-safe queue implementation. But since you're likely to use it in a producer-consumer scenario, the class you're really after may be the infinitely useful BlockingCollection.

2 Comments

Could you please post a very quick sample? Thanks
Go to the link I gave for BlockingCollection. At the bottom you'll find a simple usage example.
1

Would System.Collections.Concurrent.ConcurrentQueue<T> fit the bill?

1 Comment

Thanks. Is there a better solution in .Net 4.5? And could you please post a simple sample?
1

I'd use System.Collections.Concurrent.ConcurrentQueue.

You can safely queue and dequeue from multiple threads.

Comments

1

Look at System.Collections.Concurrent.ConcurrentQueue. If you need to wait, you could use System.Collections.Concurrent.BlockingCollection

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.