0

Status: solved

I had to make a pastebin as I had to point out line numbers.

note: not using executorsService or thread pools. just to understand that what is wrong in starting and using threads this way. If I use 1 thread. the app works Perfect!

related links:

http://www.postgresql.org/docs/9.1/static/transaction-iso.html http://www.postgresql.org/docs/current/static/explicit-locking.html

main app, http://pastebin.com/i9rVyari logs, http://pastebin.com/2c4pU1K8 , http://pastebin.com/2S3301gD

I am starting many threads (10) in a for loop with instantiating a runnable class but it seems I am getting same result from db (I am geting some string from db, then changing it) but with each thread, I get same string (despite each thread changed it.) . using jdbc for postgresql what might be the usual issues ?

line 252

and line 223

the link is marked as processed. (true) in db. other threads of crawler class also do it. so when line 252 should get a link. it should be processed = false. but I see all threads take same link.

when one of the threads crawled the link . it makes it processed = true. the others then should not crawl it. (get it) is its marked processed = true.


getNonProcessedLinkFromDB() returns a non processed link

public String getNonProcessedLink(){        line 645
public boolean markLinkAsProcesed(String link){   line 705

getNonProcessedLinkFromDB will see for processed = false links and give one out of them . limit 1 each thread has a starting interval gap of 20 secs.
within one thread. 1 or 2 seconds (estimate processing time for crawling)

line 98  keepS threads from grabbing the same url

if you see the result. one thread made it true. still others access it. waaaay after some time.

all thread are seperate. even one races. the db makes the link true at the moment the first thread processes it

2
  • Are you re-using the same database connection? Commented Sep 11, 2013 at 14:53
  • 1
    I use new BasicDao().method(); I have tried to make those methods as static too. or put their call in a sycnhronized block or make the boolean that they return as volatile . but no use. Commented Sep 11, 2013 at 14:54

2 Answers 2

2

This is a situation of not a concise question being asked. There is lots of code in there and you have no idea what is going on. You need to break it down so that you can understand where it is going wrong, then show us that bit.

Some things of potential conflict.

  • You are opening a database connections for almost every process. The normal flow of an application is to open a few connections, do some processing, then close them.
  • Are you handling database commits? I don't remember what the default setting is for a postres database, you'll have to look into it.
  • There are 3 states a single url is in. Unprocessed, being processed, processed. I don't think you are handling the 'being processed' state at all. Because being processed takes time and may fail, you have to account for those situations.

I did not read the logs because they are useless to me.

-edit for comment- Databases generally have transactions. Modifications you make in one transaction are not seen in other transactions until they are committed. Transaction can be rolled back. You'll need to look into fetching the row you just updated and see if the value has really changed. Do this in another transaction or on another connection.

The gap of 20 seconds looks like it is only when the process is started. Imagine a situation where Thread1 processes URL1 and Thread2 processes URL2. They both finish at about the same time. They both look for the next unprocessed URL (say URL3). They would both start processing this Url because they don't know another thread has started it. You need one process handing out the Url, possibly a queue is what you'd want to look at.

Logging might be improved if you knew which threads were working on which URLs. You also need a smaller sample size so that you can get your head around what is going on.

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks for responding !!! I have mentions exact line numbers that relate to the issue. The full code is necessary in order to analyse if one wants to. 1. multiple connections wont hurt. thats what DBs are for. 2. what do you mean by db commits? you mean postgres uses cache? 3 I understood you. but each thread has a gap of 20 seconds at start. it takes about 1 to 5 seconds to crawl a page (process it).
@MasoodAhmad Try reading about transactions in databases. It is what commit and rollback is used for. And without knowing how transactions work you wont be able to build multitreaded DB applications (because transactions is a way databases handle concurency)
Thanks for edit. I have done the same thing you said in logging paste. it shows urls. 2. I got what you described earlier. I even made the link as processed = true at the start of the crawl() just to be sure that it is marked as processed even before the processing begins. (that is the thread owns the link and other threads should consider it as processsed) . Still...... same. it is a mistory
@IgorRomanchenko I hope there is not row lock. heres the query. to mark : update links set processed = ? where href = ? and to get : SELECT href from links where processed = false limit 1
@MasoodAhmad Postgres generally do not use locks (but can if you want it to). The problem with your update - its changes are not seen to other crawlers until a transaction is committed.
|
0

Despite the comments and response by helpers in this post were also correct.

at the start of crawl() method body.

    synchronized(Crawler.class){
        url = getNonProcessedLinkFromDB();
        new BasicDAO().markLinkAsProcesed(url);
    }

and at the bottom of crawl() method body (when it has done processing):

    crawl(nonProcessedLinkFromDB);

actually solved the issue.

It was the gap between marking a link processed true and fetching a new one and letting other threads get the same link while the current was working on it.

Synchonized block helped further.

Thanks to helper. "Fuber" on IRC channels. Quakenet servers #java and Freenode servers ##javaee

and ALL who supported me!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.