1

I'm trying to find similarity between rows from different tables. Here is DDL.

CREATE TABLE a 
(
    id int, 
    fname text, 
    lname text, 
    email text, 
    phone text
);

INSERT INTO a 
VALUES (1, 'john', 'doe', '[email protected]', null), 
       (2, 'peter', 'green', '[email protected]', null);

CREATE TABLE b 
(
    id int, 
    fname text, 
    lname text, 
    email text, 
    phone text
);

INSERT INTO b 
VALUES (null, 'peter', 'glover', '[email protected]', '777'),
       (null, null, 'green', '[email protected]', '666');

Let's say we have following similarity configuration

fname = 0.1
lname = 0.3
email = 0.5
phone = 0.5

so we can say that similarity between

(2, 'peter', 'green', '[email protected]', null) and
(null, null, 'green', '[email protected]', '666') is 0.8 (lname + email)

(2, 'peter', 'green', '[email protected]', null) and
(null, 'peter', 'glover', '[email protected]', '777') is 0.1 (fname)

As a result I expect to get data from table b that has similarity to table a more than some threshold (let say 0.7). So according to example, I need to get something like this

id, fname, lname, email, phone, similarity
2,  null,'green', '[email protected]', '666', 0.8

where id is id from similar row from table a

I have already tried NATURAL FULL OUTER JOIN and EXCEPT, but it not works for my purpose, or I just did something wrong.

Also what kind of index would suit for query? Because table a could have a billion rows.

Update

The goal is match rows. So probably would be better store all info in one table and do a window function? Logic will be the same, rely on similarity configuration

id | fname | lname  |      email      | phone 
---+-------+--------+-----------------+-------
 1 | john  | doe    | [email protected]  | 
 2 | peter | green  | [email protected] |
   | peter | glover | [email protected]   | 777
   |       | green  | [email protected] | 666 

after some operation rows with id is null should be filled with row id has highest similarity and more than 0.7, otherwise generate a new one

1 Answer 1

1
-- get similarity betweena and b tables
with with_similarity as (
select 
a.id, b.id as tmp_id, b.fname, b.lname, b.email, b.phone,
( coalesce((a.fname = b.fname)::int, 0) * 0.1 +
        coalesce((a.lname = b.lname)::int, 0) * 0.3 +
        coalesce((a.email = b.email)::int, 0) * 0.5 +
        coalesce((a.phone = b.phone)::int, 0) * 0.5
) as similarity
from b
cross join a
), 
-- as we have matched weight for all rows, we can pickup rank them
matched as (
select *,
ROW_NUMBER() OVER(PARTITION BY tmp_id ORDER BY similarity DESC) AS rk
from with_similarity
)

-- pick up best match and insert matched + not matched rows
select id, fname, lname, email, phone from matched where rk=1 and similarity >= 0.7
union all
select tmp_id, fname, lname, email, phone from matched where similarity < 0.7 and rk = 1;
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.