Postgres find similar rows between two tables

Question

I'm trying to find similarity between rows from different tables. Here is DDL.

CREATE TABLE a 
(
    id int, 
    fname text, 
    lname text, 
    email text, 
    phone text
);

INSERT INTO a 
VALUES (1, 'john', 'doe', '[email protected]', null), 
       (2, 'peter', 'green', '[email protected]', null);

CREATE TABLE b 
(
    id int, 
    fname text, 
    lname text, 
    email text, 
    phone text
);

INSERT INTO b 
VALUES (null, 'peter', 'glover', '[email protected]', '777'),
       (null, null, 'green', '[email protected]', '666');

Let's say we have following similarity configuration

fname = 0.1
lname = 0.3
email = 0.5
phone = 0.5

so we can say that similarity between

(2, 'peter', 'green', '[email protected]', null) and
(null, null, 'green', '[email protected]', '666') is 0.8 (lname + email)

(2, 'peter', 'green', '[email protected]', null) and
(null, 'peter', 'glover', '[email protected]', '777') is 0.1 (fname)

As a result I expect to get data from table b that has similarity to table a more than some threshold (let say 0.7). So according to example, I need to get something like this

id, fname, lname, email, phone, similarity
2,  null,'green', '[email protected]', '666', 0.8

where id is id from similar row from table a

I have already tried NATURAL FULL OUTER JOIN and EXCEPT, but it not works for my purpose, or I just did something wrong.

Also what kind of index would suit for query? Because table a could have a billion rows.

Update

The goal is match rows. So probably would be better store all info in one table and do a window function? Logic will be the same, rely on similarity configuration

id | fname | lname  |      email      | phone 
---+-------+--------+-----------------+-------
 1 | john  | doe    | [email protected]  | 
 2 | peter | green  | [email protected] |
   | peter | glover | [email protected]   | 777
   |       | green  | [email protected] | 666

after some operation rows with id is null should be filled with row id has highest similarity and more than 0.7, otherwise generate a new one

kingy_pingvy · Accepted Answer · 2018-04-20 00:22:47Z

1

-- get similarity betweena and b tables
with with_similarity as (
select 
a.id, b.id as tmp_id, b.fname, b.lname, b.email, b.phone,
( coalesce((a.fname = b.fname)::int, 0) * 0.1 +
        coalesce((a.lname = b.lname)::int, 0) * 0.3 +
        coalesce((a.email = b.email)::int, 0) * 0.5 +
        coalesce((a.phone = b.phone)::int, 0) * 0.5
) as similarity
from b
cross join a
), 
-- as we have matched weight for all rows, we can pickup rank them
matched as (
select *,
ROW_NUMBER() OVER(PARTITION BY tmp_id ORDER BY similarity DESC) AS rk
from with_similarity
)

-- pick up best match and insert matched + not matched rows
select id, fname, lname, email, phone from matched where rk=1 and similarity >= 0.7
union all
select tmp_id, fname, lname, email, phone from matched where similarity < 0.7 and rk = 1;

answered Apr 20, 2018 at 0:22

kingy_pingvy

877 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Postgres find similar rows between two tables

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related