I have 3 tables:
- Pi - images
- Pidl - images dl log => Pidl
- Pirl - images resize log => Pidl
Basically an image is downloaded and a log record is created in Pidl. After that, it's resized and a record is created in Pirl. Said record being connected to the Pidl record.
I am writing a query as to find which images need to be resized and it basically queries Pidl. The algo I've devised is simple:
for each Image in Pi {
pidlA=newest_pidl(Image);
if(pidlA.status == success) {
pirlA=newest_pirl(Image);
if(pirlA.pidl.hash != pidlA.hash)
{
go;
}
else if(pirlA.status != success){
failed_attempts = failed_pirl_count(pirlA,newest_succesful_pirl(Image))
decide based on pirlA.time and failed_attempts if go or not
}
else
{
dont go;
}
}
else
{
dont go;
}
}
And now my query(altough is not yet finished, the failed attempts part is missing, but it's already too slow, so first I'd like to fix that).
SELECT
pidl1A.pidl_id
FROM Pidl as pidl1A
LEFT JOIN Pidl as pidl2A
ON (
pidl1A.pidl_pi_id = pidl2A.pidl_pi_id AND
pidl2A.pidl_status = 1 AND
(pidl2A.pidl_time > pidl1A.pidl_time OR
(pidl2A.pidl_id > pidl1A.pidl_id and pidl1A.pidl_time=pidl2A.pidl_time)
)
)
LEFT JOIN (
#newest pirl subquery#
SELECT
pidl1B.pidl_pi_id as sub_pi_id,
pidl1B.pidl_hash as sub_pidl_hash,
pirl1B.pirl_id as sub_pirl_id,
pirl1B.pirl_status as sub_pirl_status
FROM Pirl as pirl1B
INNER JOIN Pidl as pidl1B ON (pirl1B.pirl_pidl_id = pidl1B.pidl_id)
LEFT JOIN (
SELECT
pidl2B.pidl_pi_id as sub_pi_id,
pirl2B.pirl_id as sub_pirl_id,
pirl2B.pirl_time as sub_pirl_time
FROM Pirl as pirl2B
INNER JOIN Pidl as pidl2B ON (pirl2B.pirl_pidl_id = pidl2B.pidl_id)
WHERE 1
) as pirl3B
ON (
pirl3B.sub_pi_id = pidl1B.pidl_pi_id and
(pirl3B.sub_pirl_time > pirl1B.pirl_time or
(pirl3B.sub_pirl_time = pirl1B.pirl_time and
pirl3B.sub_pirl_id > pirl1B.pirl_id)
)
)
WHERE
pirl3B.sub_pirl_id is null
) as pirl1A
ON (pirl1A.sub_pi_id = pidl1A.pidl_pi_id)
WHERE
pidl1A.pidl_status = 1 AND pidl2A.pidl_id IS NULL
AND (
pirl1A.sub_pirl_id IS NULL
OR (
pidl1A.pidl_hash != pirl1A.sub_pidl_hash
)
OR (
pirl1A.sub_pirl_status != 1
)
)
And this is my db schema:
CREATE TABLE Pi (
`pi_id` int,
PRIMARY KEY (`pi_id`)
)
;
CREATE TABLE Pidl
(
`pidl_id` int,
`pidl_pi_id` int,
`pidl_status` int,
`pidl_time` int,
`pidl_hash` varchar(16),
PRIMARY KEY (`pidl_id`)
)
;
alter table Pidl
add constraint fk1_branchNo foreign key (pidl_pi_id) references Pi (pi_id);
CREATE TABLE Pirl
(
`pirl_id` int,
`pirl_pidl_id` int,
`pirl_status` int,
`pirl_time` int,
PRIMARY KEY (`pirl_id`)
)
;
alter table Pirl
add constraint fk2_branchNo foreign key (pirl_pidl_id) references Pidl (pidl_id);
INSERT INTO Pi
(`pi_id`)
VALUES
(3),
(4),
(5);
INSERT INTO Pidl
(`pidl_id`, `pidl_pi_id`,`pidl_status`,`pidl_time`, `pidl_hash`)
VALUES
(1, 3, 1,100, 'hashA'),
(2, 3, 1,150,'hashB'),
(3, 4, 2, 200,'hashC'),
(4, 3, 1, 200,'hashA')
;
INSERT INTO Pirl
(`pirl_id`, `pirl_pidl_id`,`pirl_status`,`pirl_time`)
VALUES
(1, 2, 0,100),
(2, 3, 1,150),
(3, 1, 2, 200)
;
Of course with 3 records it's fast. But with around 10-30k it takes more than 5 seconds. What I've found is that the thing that makes it slow is the last part of the where:
AND (
pirl1A.sub_pirl_id IS NULL
OR (
pidl1A.pidl_hash != pirl1A.sub_pidl_hash
)
OR (
pirl1A.sub_pirl_status != 1
)
)
The other strange thing that I've found is that by using DISTINCT, the query got a bit faster but not fast enough.