I have one question regarding optimization of index in Postgres, I didn't find much help online and I have struggled to get the answer myself by testing.
I have this table
CREATE TABLE "public"."crawls" (
"id" uuid NOT NULL DEFAULT uuid_generate_v4(),
"parent_id" uuid,
"group_id" timestamp,
"url" varchar(2083) NOT NULL,
"done" boolean;
PRIMARY KEY ("id")
);
CREATE UNIQUE INDEX "parentid_groupid_url" ON "public"."urls" USING BTREE ("parent_id","group_id","url");
It's an URLs store, that is used to compute a comprehensive list of URLs that are UNIQUE per parent and per group. I only need exact match on this index. This means parent_id can have multiple times the same time the same URLs as long as the group_id is different.
The table contains hundreds of millions of URLs and is mainly used for write, the UNIQUE index is for deduplication.
UPDATE crawls
SET
done = TRUE
WHERE
url = $1 AND
parent_id = $2 AND
group_id = $3
INSERT
INTO crawls (
url,
parent_id,
group_id
) VALUES
('long urls', uuid, date)
ON CONFLICT parentid_groupid_url DO NOTHING
Currently the perf are okay but could be better, and the index size is larger than the table itself because of url column.
I was wondering how I could improve the size and/or the perf ? (both if possible)
I thought about using a new column to hash (md5, sha1) the URL and use it in the index instead of the URL, so that the length is consistant, smaller and may be faster for Postgres, but I didn't find any help on that. I'm not sure it's efficient because of the "randomness" of a hash and I have hard time testing this hypothesis due to the size and the time to build the index on my prod.
Refs I found online:
Thanks,
REFERENCING (public.crawls.id)? (a self-reference)