0

I have one question regarding optimization of index in Postgres, I didn't find much help online and I have struggled to get the answer myself by testing.

I have this table

CREATE TABLE "public"."crawls" (
    "id" uuid NOT NULL DEFAULT uuid_generate_v4(),
    "parent_id" uuid,
    "group_id" timestamp,
    "url" varchar(2083) NOT NULL,
    "done" boolean;
    PRIMARY KEY ("id")
);
CREATE UNIQUE INDEX "parentid_groupid_url" ON "public"."urls" USING BTREE ("parent_id","group_id","url");

It's an URLs store, that is used to compute a comprehensive list of URLs that are UNIQUE per parent and per group. I only need exact match on this index. This means parent_id can have multiple times the same time the same URLs as long as the group_id is different.

The table contains hundreds of millions of URLs and is mainly used for write, the UNIQUE index is for deduplication.

  UPDATE crawls
  SET
    done = TRUE
  WHERE
    url = $1 AND
    parent_id = $2 AND
    group_id = $3

INSERT 

INTO crawls (
      url,
      parent_id,
      group_id
    ) VALUES
      ('long urls', uuid, date)
    ON CONFLICT parentid_groupid_url DO NOTHING

Currently the perf are okay but could be better, and the index size is larger than the table itself because of url column.

I was wondering how I could improve the size and/or the perf ? (both if possible)

I thought about using a new column to hash (md5, sha1) the URL and use it in the index instead of the URL, so that the length is consistant, smaller and may be faster for Postgres, but I didn't find any help on that. I'm not sure it's efficient because of the "randomness" of a hash and I have hard time testing this hypothesis due to the size and the time to build the index on my prod.

Refs I found online:

Thanks,

2
  • What is parent_id ? maybe REFERENCING (public.crawls.id) ? (a self-reference) Commented Feb 8, 2020 at 15:47
  • it's an outside ref, it's not named liked it in the real db but I wanted to simplify the schema for the question (and I failed ahah). Commented Feb 8, 2020 at 20:48

1 Answer 1

1

I thought about using a new column to hash (md5, sha1) the URL and use it in the index instead of the URL, so that the length is consistant, smaller and may be faster for Postgres

create index on crawls (parent_id,group_id,md5(url));

This will automatically enforce uniqueness (and also prohibit md5 collisions which are really distinct on the full URL--but in the absence of malice the chances of this occurring are tiny). However, it will not automatically be used for fast look-up, you have to adapt your queries to allow it to be used:

WHERE
  md5(url) = md5($1) AND
  parent_id = $2 AND
  group_id = $3

You could save more space by using a reprenation shorter than hex:

create index on crawls (parent_id,group_id,decode(md5(url),'hex'));

But that will make it even more cumbersome to use.

I'm not sure it's efficient because of the "randomness" of a hash

It depends entirely on your usage pattern and your data distribution. If you commonly access a series of records with the same parent_id and group_id and adjacent urls, and the number of records with same parent_id and group_id is large, then hashing the urls could decrease the effectiveness of caching.

I have hard time testing this hypothesis due to the size and the time to build the index on my prod.

Not having a test environment is working with both hands tied behind your back.

Sign up to request clarification or add additional context in comments.

4 Comments

Would you advice that pattern in a real prod use case? I'm surprised postgres as not "better" optim for exact match. The minimal change required seems largely acceptable compared to the potential gain. > and the number of records with same parent_id and group_id is large It varies from hundred to 500k. I'm not sure how a btree index works with text column (and thus with md5), will it try to split the tree with long text or will it end up scanning the whole tree after parent->group ?
From a stability/bug perspective I'd have no problem using it in production. But I'd want to see evidence it actually worked to improve performance, otherwise what is the point? I don't understand what you are asking about 'split the tree with long text'.
why not using hash indexes on the URL column?
@iRestMyCaseYourHonor Because he wanted a multi-column index.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.