1

I have table with SentenceID and bag of words(tokenizedsentence::varchar[]):

sID | TokenizedSentence 
1   | {0, 0, 0, 0, 1, 1, 0, 0, 1, 0}
2   | {1, 1, 0, 0, 1, 1, 1, 1, 1, 1}
3   | {0, 1, 1, 0, 1, 0, 0, 0, 1, 1}
4   | {1, 1, 0, 1, 1, 0, 1, 0, 1, 1}
5   | {1, 0, 0, 0, 1, 1, 0, 0, 1, 0}

I want to compare sentences for similarity using bag of words representation. I wrote function, but I am missing something. The idea is to compare each array value to corresponding value, only if value is 1 (if the word is available in the sentence) and increase counter. After going through all the values devide counter by the length of the array. The function I wrote:

CREATE OR REPLACE FUNCTION bow() RETURNS float AS $$
DECLARE
    length int:= array_length(nlpdata.tokenizedsentence, 1) 
    counter int;
    result float;
BEGIN
    FROM nlpdata a, nlpdata b;
    FOR i IN 0..length LOOP
        IF tokenizedSentence[i] = 1 THEN
            IF a.tokenizedSentence[i] = b.tokenizedSentence[i] THEN
                counter := counter + 1;
            END IF;
        END IF;
    END LOOP;   
    result = counter / length
    RETURN;
END;
$$ LANGUAGE plpgsql;

Also no idea how to delcare "FROM nlpdata a, nlpdata b". Any ideas?

0

1 Answer 1

1

Ho to do it without function (if I properly understand the task):

with t(t_id, t_serie) as ( -- Test data
  values 
    (1, array[0, 0, 0, 0, 1, 1, 0, 0, 1, 0]),
    (2, array[1, 1, 0, 0, 1, 1, 1, 1, 1, 1]),
    (3, array[0, 1, 1, 0, 1, 0, 0, 0, 1, 1]),
    (4, array[1, 1, 0, 1, 1, 0, 1, 0, 1, 1]),
    (5, array[1, 0, 0, 0, 1, 1, 0, 0, 1, 0])
)

select 
  *, -- Data columns
  -- Positions of 1s in the arrays
  array_positions(t1.t_serie, 1), array_positions(t2.t_serie, 1),
  -- Intersections of the positins
  array(select unnest(array_positions(t1.t_serie, 1)) intersect select unnest(array_positions(t2.t_serie, 1))),
  -- Count of intersections / length of arrays
  cardinality(array(select unnest(array_positions(t1.t_serie, 1)) intersect select unnest(array_positions(t2.t_serie, 1))))::float / cardinality(t1.t_serie)::float
from t as t1 cross join t as t2
where
  t1.t_id <> t2.t_id
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, the query does what I want, now I will check if it works with my data. There are 4623 rows, so it will take some time.
@Masyaf BTW it is possible to make query faster a bit by removing duplications like row_a-row_b and row_b-row_a. Just try to change t1.t_id <> t2.t_id to t1.t_id < t2.t_id
@ Abelisto Thanks I will try it.
@Masyaf Or probably even better: from t as t1 cross join t as t2 to from t as t1 join t as t2 on (t1.t_id < t2.t_id). Good luck.
Actually I used from table_t t1, table_t t2 instead of from t as t1 cross join t as t2, but thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.