multiple threads using same value fetched by database DAO class method

Question

Status: solved

I had to make a pastebin as I had to point out line numbers.

note: not using executorsService or thread pools. just to understand that what is wrong in starting and using threads this way. If I use 1 thread. the app works Perfect!

related links:

http://www.postgresql.org/docs/9.1/static/transaction-iso.html http://www.postgresql.org/docs/current/static/explicit-locking.html

main app, http://pastebin.com/i9rVyari logs, http://pastebin.com/2c4pU1K8 , http://pastebin.com/2S3301gD

I am starting many threads (10) in a for loop with instantiating a runnable class but it seems I am getting same result from db (I am geting some string from db, then changing it) but with each thread, I get same string (despite each thread changed it.) . using jdbc for postgresql what might be the usual issues ?

line 252

and line 223

the link is marked as processed. (true) in db. other threads of crawler class also do it. so when line 252 should get a link. it should be processed = false. but I see all threads take same link.

when one of the threads crawled the link . it makes it processed = true. the others then should not crawl it. (get it) is its marked processed = true.

getNonProcessedLinkFromDB() returns a non processed link

public String getNonProcessedLink(){        line 645
public boolean markLinkAsProcesed(String link){   line 705

getNonProcessedLinkFromDB will see for processed = false links and give one out of them . limit 1 each thread has a starting interval gap of 20 secs.
within one thread. 1 or 2 seconds (estimate processing time for crawling)

line 98  keepS threads from grabbing the same url

if you see the result. one thread made it true. still others access it. waaaay after some time.

all thread are seperate. even one races. the db makes the link true at the moment the first thread processes it

I use new BasicDao().method(); I have tried to make those methods as static too. or put their call in a sycnhronized block or make the boolean that they return as volatile . but no use. — Masood Ahmad
– Masood Ahmad, Commented Sep 11, 2013 at 14:54

pimaster · Accepted Answer · 2013-09-11 15:18:59Z

2

This is a situation of not a concise question being asked. There is lots of code in there and you have no idea what is going on. You need to break it down so that you can understand where it is going wrong, then show us that bit.

Some things of potential conflict.

You are opening a database connections for almost every process. The normal flow of an application is to open a few connections, do some processing, then close them.
Are you handling database commits? I don't remember what the default setting is for a postres database, you'll have to look into it.
There are 3 states a single url is in. Unprocessed, being processed, processed. I don't think you are handling the 'being processed' state at all. Because being processed takes time and may fail, you have to account for those situations.

I did not read the logs because they are useless to me.

-edit for comment- Databases generally have transactions. Modifications you make in one transaction are not seen in other transactions until they are committed. Transaction can be rolled back. You'll need to look into fetching the row you just updated and see if the value has really changed. Do this in another transaction or on another connection.

The gap of 20 seconds looks like it is only when the process is started. Imagine a situation where Thread1 processes URL1 and Thread2 processes URL2. They both finish at about the same time. They both look for the next unprocessed URL (say URL3). They would both start processing this Url because they don't know another thread has started it. You need one process handing out the Url, possibly a queue is what you'd want to look at.

Logging might be improved if you knew which threads were working on which URLs. You also need a smaller sample size so that you can get your head around what is going on.

edited Sep 11, 2013 at 15:18

answered Sep 11, 2013 at 14:59

pimaster

1,96710 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Masood Ahmad Over a year ago

Thanks for responding !!! I have mentions exact line numbers that relate to the issue. The full code is necessary in order to analyse if one wants to. 1. multiple connections wont hurt. thats what DBs are for. 2. what do you mean by db commits? you mean postgres uses cache? 3 I understood you. but each thread has a gap of 20 seconds at start. it takes about 1 to 5 seconds to crawl a page (process it).

Ihor Romanchenko Over a year ago

@MasoodAhmad Try reading about transactions in databases. It is what commit and rollback is used for. And without knowing how transactions work you wont be able to build multitreaded DB applications (because transactions is a way databases handle concurency)

Masood Ahmad Over a year ago

Thanks for edit. I have done the same thing you said in logging paste. it shows urls. 2. I got what you described earlier. I even made the link as processed = true at the start of the crawl() just to be sure that it is marked as processed even before the processing begins. (that is the thread owns the link and other threads should consider it as processsed) . Still...... same. it is a mistory

Masood Ahmad Over a year ago

@IgorRomanchenko I hope there is not row lock. heres the query.

to mark : update links set processed = ? where href = ?        and     to get : SELECT href from links where processed = false limit 1

Ihor Romanchenko Over a year ago

@MasoodAhmad Postgres generally do not use locks (but can if you want it to). The problem with your update - its changes are not seen to other crawlers until a transaction is committed.

|

Masood Ahmad · Accepted Answer · 2013-09-11 21:59:29Z

0

Despite the comments and response by helpers in this post were also correct.

at the start of crawl() method body.

    synchronized(Crawler.class){
        url = getNonProcessedLinkFromDB();
        new BasicDAO().markLinkAsProcesed(url);
    }

and at the bottom of crawl() method body (when it has done processing):

    crawl(nonProcessedLinkFromDB);

actually solved the issue.

It was the gap between marking a link processed true and fetching a new one and letting other threads get the same link while the current was working on it.

Synchonized block helped further.

Thanks to helper. "Fuber" on IRC channels. Quakenet servers #java and Freenode servers ##javaee

and ALL who supported me!

edited Sep 11, 2013 at 21:59

answered Sep 11, 2013 at 21:54

Masood Ahmad

7414 gold badges15 silver badges38 bronze badges

Collectives™ on Stack Overflow

multiple threads using same value fetched by database DAO class method

2 Answers 2

8 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related