I'm trying to find similarity between rows from different tables. Here is DDL.
CREATE TABLE a
(
id int,
fname text,
lname text,
email text,
phone text
);
INSERT INTO a
VALUES (1, 'john', 'doe', '[email protected]', null),
(2, 'peter', 'green', '[email protected]', null);
CREATE TABLE b
(
id int,
fname text,
lname text,
email text,
phone text
);
INSERT INTO b
VALUES (null, 'peter', 'glover', '[email protected]', '777'),
(null, null, 'green', '[email protected]', '666');
Let's say we have following similarity configuration
fname = 0.1
lname = 0.3
email = 0.5
phone = 0.5
so we can say that similarity between
(2, 'peter', 'green', '[email protected]', null) and
(null, null, 'green', '[email protected]', '666') is 0.8 (lname + email)
(2, 'peter', 'green', '[email protected]', null) and
(null, 'peter', 'glover', '[email protected]', '777') is 0.1 (fname)
As a result I expect to get data from table b that has similarity to table a more than some threshold (let say 0.7). So according to example, I need to get something like this
id, fname, lname, email, phone, similarity
2, null,'green', '[email protected]', '666', 0.8
where id is id from similar row from table a
I have already tried NATURAL FULL OUTER JOIN and EXCEPT, but it not works for my purpose, or I just did something wrong.
Also what kind of index would suit for query? Because table a could have a billion rows.
Update
The goal is match rows. So probably would be better store all info in one table and do a window function? Logic will be the same, rely on similarity configuration
id | fname | lname | email | phone
---+-------+--------+-----------------+-------
1 | john | doe | [email protected] |
2 | peter | green | [email protected] |
| peter | glover | [email protected] | 777
| | green | [email protected] | 666
after some operation rows with id is null should be filled with row id has highest similarity and more than 0.7, otherwise generate a new one