1

I am looking for an architectural solution to the following problem:

General description of problem

I have lots of distinct data entities (approximately 15 millions). Every entity is associated with certain bunch of keywords (or tags) (from several ones to hunderds per entity in the worst case).

Given N different keywords, my task is to retreive the following results in the following order:

  • all entities which are associated with all N given keywords;
  • entities which contains any combination of N-1 given keywords;
  • entities which contains any combination of N-2 given keywords;
  • and so on (may be just down to certain N-K-limit, but in general case, down to 1-keyword match).

Naive approach

The naive solution I came to is to create simple reverse index for each keyword using MySQL/PostgreSQL RDBMS. Generally it will contain two tables:

Table Keywords              Table Entities
---------------------       ---------------------
   id      keyword            id     keyword_id
---------------------       ---------------------
    1         tag1             1              1
    2         tag2             1              2             
    3         tag3             2              3 
  • Table Keywords to store keywords;
  • Table Entities to store relations between entities id-s and keyword_id-s.

For each keyword1 & keyword2 & ... & keywordN query I am going to retreive all entity ids sets for every query keyword and then perform manual searches for N-keywords, N-1-keywords, etc. mathces on application level.

Problems

Obviously this approach will meet two problems:

  1. Long time of receiving datasets from billion-entries Entities table (even with the usage of index);
  2. Long time of performing application level searches for N-keywords matches on application level.

For both problems consider that one tag can be associated with several millions of entries in general case.

How to deal efficiently with these problems?

2
  • Repost of dba.stackexchange.com/q/53877/7788 . Please do not copy and paste questions between sites. It wastes everybody's time. Commented Nov 25, 2013 at 5:38
  • @CraigRinger I understood, the cross-post is deleted now. Commented Nov 25, 2013 at 10:03

3 Answers 3

3

I'd use the intarray extension and a GiST index for this.

Store your entities with tag arrays, eg:

CREATE EXTENSION intarray;

CREATE TABLE entity(
    entity_id BIGSERIAL PRIMARY KEY,
    tags integer[] not null
);

INSERT INTO entity(tags) values (ARRAY[1,2,3]);
INSERT INTO entity(tags) values (ARRAY[1,3,5]);
INSERT INTO entity(tags) values (ARRAY[1]);
INSERT INTO entity(tags) values (ARRAY[]::integer[]);

CREATE INDEX entity_tags_idx ON entity USING GIST(tags);

and query with something vaguely like:

SELECT
    *,
    ARRAY[1,3] & tags AS matched_tags 
FROM entity 
WHERE ARRAY[1,3] && tags 
ORDER BY array_length(ARRAY[1,3] & tags,1) DESC;

The index will be used to exclude rows that don't have any matching tags. The result set will then get sorted on the number of matching tags in descending order. No order is imposed within groups that have the same number of matching tags but you could add a second sort key for that.

This should work well so long as each entity doesn't have a really huge list of tags. Don't calculate the "matched_tags" if you don't need it. If you do need it, consider wrapping its calculation up into a subquery then using the calculated value in the ORDER BY instead of re-calculating it there.

You'll probably want a machine with enough RAM to fit the GiST index in it. If the UPDATE / INSERT rate is quite low you could use a GIN index instead; performance for GIN is better for data that changes very little, and very bad for data that changes a lot.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for brilliant advice and inspiration for the further work! In the case of my application it seems better to use GiN index, because I am going to operate static datasets updating the whole dataset in relatively larges periods of time.
What do you think about perspective of storing the whole data in one big table? Is it better to implement some kind of sharding through several tables or it does not matter (or, conversely, it will affect perfomance)? I am really worring about the query latency for performing such kind of queries to 15mln entries table with lots of high-coverage tags... (P.S.: Actually I have 49 Gb of RAM on my box.)
1

You could merge it all into 1 table, if I'm understanding your schema correctly. I apologize in advance for the lengthy schema creation, but I wanted to prove to myself that it would, in fact, use the indexes. This example uses postgres and you can create gist or gin index on relation if you install the intarray extension. I tested on postgres 9.3

create table keyword (id serial primary key, tag varchar, relation integer[]);

insert into keyword(id, tag,relation) values(1,'tag1',array[1]);
insert into keyword(id, tag,relation) values(2,'tag2',array[1,2]);
insert into keyword(id, tag,relation) values(3,'tag3',array[1,2,3]);
insert into keyword(id, tag,relation) values(4,'tag4',array[1,2,3,4]);
insert into keyword(id, tag,relation) values(5,'tag5',array[1,2,3,4,5]);
insert into keyword(id, tag,relation) values(6,'tag6',array[1,2,3,4,5,6]);
insert into keyword(id, tag,relation) values(7,'tag7',array[1,2,3,4,5,6,7]);
insert into keyword(id, tag,relation) values(8,'tag8',array[1,2,3,4,5,6,7,8]);
insert into keyword(id, tag,relation) values(9,'tag9',array[1,2,3,4,5,6,7,8,9]);
insert into keyword(id, tag,relation) values(10,'tag10',array[1,2,3,4,5,6,7,8,9,10]);
insert into keyword(id, tag,relation) values(11,'tag11',array[11]);
insert into keyword(id, tag,relation) values(12,'tag12',array[12]);
insert into keyword(id, tag,relation) values(13,'tag13',array[13]);
insert into keyword(id, tag,relation) values(14,'tag14',array[14]);
insert into keyword(id, tag,relation) values(15,'tag15',array[15]);
insert into keyword(id, tag,relation) values(16,'tag16',array[16,13,12]);
insert into keyword(id, tag,relation) values(17,'tag17',array[17,10,9,5,2,1]);
insert into keyword(id, tag,relation) values(18,'tag18',array[18,1,2,3]);
insert into keyword(id, tag,relation) values(19,'tag19',array[19,1]);
insert into keyword(id, tag,relation) values(20,'tag20',array[20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1]);
insert into keyword(id, tag,relation) values(21,'tag21',array[21]);
insert into keyword(id, tag,relation) values(22,'tag22',array[22]);
insert into keyword(id, tag,relation) values(23,'tag23',array[23]);
insert into keyword(id, tag,relation) values(24,'tag24',array[24]);
insert into keyword(id, tag,relation) values(25,'tag25',array[25]);
insert into keyword(id, tag,relation) values(26,'tag26',array[26]);
insert into keyword(id, tag,relation) values(27,'tag27',array[27]);
insert into keyword(id, tag,relation) values(28,'tag28',array[28]);
insert into keyword(id, tag,relation) values(29,'tag29',array[29]);
insert into keyword(id, tag,relation) values(30,'tag30',array[30]);
insert into keyword(id, tag,relation) values(31,'tag31',array[30]);
insert into keyword(id, tag,relation) values(32,'tag32',array[30]);
insert into keyword(id, tag,relation) values(33,'tag33',array[30]);
insert into keyword(id, tag,relation) values(34,'tag34',array[30]);
insert into keyword(id, tag,relation) values(35,'tag35',array[30]);
insert into keyword(id, tag,relation) values(36,'tag36',array[30]);
insert into keyword(id, tag,relation) values(37,'tag37',array[30]);
insert into keyword(id, tag,relation) values(38,'tag38',array[30]);
insert into keyword(id, tag,relation) values(39,'tag39',array[30]);
insert into keyword(id, tag,relation) values(40,'tag40',array[30]);
insert into keyword(id, tag,relation) values(41,'tag41',array[30]);
insert into keyword(id, tag,relation) values(42,'tag42',array[30]);
insert into keyword(id, tag,relation) values(43,'tag43',array[30]);
insert into keyword(id, tag,relation) values(44,'tag44',array[30]);
insert into keyword(id, tag,relation) values(45,'tag45',array[30]);
insert into keyword(id, tag,relation) values(46,'tag46',array[30]);
insert into keyword(id, tag,relation) values(47,'tag47',array[30]);
insert into keyword(id, tag,relation) values(48,'tag48',array[30]);
insert into keyword(id, tag,relation) values(49,'tag49',array[30]);
insert into keyword(id, tag,relation) values(50,'tag50',array[30]);
insert into keyword (id, tag) (select generate_series, 'tag'||generate_series from generate_series(51,500));

create index on keyword(array_length(relation,1));
/*Uncomment the line below if you have intarray installed */
/*create index on keyword using gist(relation);*/
analyze  keyword;

So, to find all elements that have 5 relationships with other tags, simply run the following:

select * from keyword where array_length(relation,1)=5

To find all elements that are related with tag 17, run the following:

select * from keyword where relation @> array[17]

The relationship array column could potentially hold duplicate values which would screw things up, so you could write a function and a check constraint to prevent this, or write this code in the app-- a check constraint could add substantially to the cost of an insert.

Feel free to play around with this schema on SQLFiddle, I've created the schema here: SqlFiddle

2 Comments

Thank you so much for your involvment and expecially for SQLfiddle sandbox for experiments! Although the queries offered by Craig Ringer in his answer more fully satisfy the exact requirements of the problem, you suggested the same GiST+intarray approach which seems to be the efficient solution!
@zavg Thanks. Didn't want to submit a duplicate approach, but when I started working on the answer, nobody had answered yet. I guess there is an advantage to brevity. Don't forget about the index on array_length(relation,1), as I believe it's an important part of the solution if I'm understanding the problem correctly. Good Luck
0

The schema you propose doesn't make a lot of sense. You've got an N:M relationship between things you call entities (rather confusingly since that's commonly used for any data structure represented by a single table in a relational database). I assume that somethin has got lost in the re-telling and you actually mean you have three tables:

keywords {id, keyword}
entities {id, ....}
entity_words {keyword_id, entity_id}

The only way to significantly improve on this schema is to denormalize the match count into the "entities" record:

UPDATE entities e
SET e.matches = (SELECT COUNT(DISTINCT ew.keyword_id)
   FROM entity_words ew
   WHERE ew.entity_id=e.id);

....while you could also add a trigger on the keywords table to update the relevant entities record when the data in the keywords is changed, this seems overkill when you must have a mechanism for creaing the mapping in the first place.

1 Comment

1) I agree that the actual data schema of my domain if M:N but for the purposes of my task is not nesessary to maintan the table with Entities data. So in my description Entities table corresponds to your entitiy_words, and entities are not actually necessary since I can work just on id-s level. 2) Unfortunately your denormalization query does not cover my case at all since it just stores the number of tags linked to the certain entity and does not answer a question I formulated in my task...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.