similarity check with postgres function

Question

I have table with SentenceID and bag of words(tokenizedsentence::varchar[]):

sID | TokenizedSentence 
1   | {0, 0, 0, 0, 1, 1, 0, 0, 1, 0}
2   | {1, 1, 0, 0, 1, 1, 1, 1, 1, 1}
3   | {0, 1, 1, 0, 1, 0, 0, 0, 1, 1}
4   | {1, 1, 0, 1, 1, 0, 1, 0, 1, 1}
5   | {1, 0, 0, 0, 1, 1, 0, 0, 1, 0}

I want to compare sentences for similarity using bag of words representation. I wrote function, but I am missing something. The idea is to compare each array value to corresponding value, only if value is 1 (if the word is available in the sentence) and increase counter. After going through all the values devide counter by the length of the array. The function I wrote:

CREATE OR REPLACE FUNCTION bow() RETURNS float AS $$
DECLARE
    length int:= array_length(nlpdata.tokenizedsentence, 1) 
    counter int;
    result float;
BEGIN
    FROM nlpdata a, nlpdata b;
    FOR i IN 0..length LOOP
        IF tokenizedSentence[i] = 1 THEN
            IF a.tokenizedSentence[i] = b.tokenizedSentence[i] THEN
                counter := counter + 1;
            END IF;
        END IF;
    END LOOP;   
    result = counter / length
    RETURN;
END;
$$ LANGUAGE plpgsql;

Also no idea how to delcare "FROM nlpdata a, nlpdata b". Any ideas?

Abelisto · Accepted Answer · 2016-02-11 17:43:31Z

1

Ho to do it without function (if I properly understand the task):

with t(t_id, t_serie) as ( -- Test data
  values 
    (1, array[0, 0, 0, 0, 1, 1, 0, 0, 1, 0]),
    (2, array[1, 1, 0, 0, 1, 1, 1, 1, 1, 1]),
    (3, array[0, 1, 1, 0, 1, 0, 0, 0, 1, 1]),
    (4, array[1, 1, 0, 1, 1, 0, 1, 0, 1, 1]),
    (5, array[1, 0, 0, 0, 1, 1, 0, 0, 1, 0])
)

select 
  *, -- Data columns
  -- Positions of 1s in the arrays
  array_positions(t1.t_serie, 1), array_positions(t2.t_serie, 1),
  -- Intersections of the positins
  array(select unnest(array_positions(t1.t_serie, 1)) intersect select unnest(array_positions(t2.t_serie, 1))),
  -- Count of intersections / length of arrays
  cardinality(array(select unnest(array_positions(t1.t_serie, 1)) intersect select unnest(array_positions(t2.t_serie, 1))))::float / cardinality(t1.t_serie)::float
from t as t1 cross join t as t2
where
  t1.t_id <> t2.t_id

answered Feb 11, 2016 at 17:43

Abelisto

15.8k3 gold badges38 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Masyaf Over a year ago

Thanks, the query does what I want, now I will check if it works with my data. There are 4623 rows, so it will take some time.

Abelisto Over a year ago

@Masyaf BTW it is possible to make query faster a bit by removing duplications like row_a-row_b and row_b-row_a. Just try to change t1.t_id <> t2.t_id to t1.t_id < t2.t_id

Masyaf Over a year ago

@ Abelisto Thanks I will try it.

Abelisto Over a year ago

@Masyaf Or probably even better: from t as t1 cross join t as t2 to from t as t1 join t as t2 on (t1.t_id < t2.t_id). Good luck.

Masyaf Over a year ago

Actually I used from table_t t1, table_t t2 instead of from t as t1 cross join t as t2, but thanks.

Collectives™ on Stack Overflow

similarity check with postgres function

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related