1

I have a dataset with many text comments from a social media site. I want to find all instances where at least two country names are featured in the text. What I have right now looks like:

SELECT * FROM comments WHERE body ~* '(Canada|United States|Mexico)'

This lets me find instances of any mention of these three countries. But what if I want to find instances where at least two of these names are present.

2 Answers 2

1

You could check each condition independently, convert the boolean results to integers, and ensure that the sum of matches is at least 2:

 where ( 
       (body ilike '%Canada%')::int 
     + (body ilike '%United States%')::int
     + (body ilike '%Mexico%')::int
 ) >= 2

Of course this also work with regexes, although this might be less efficient than like:

 WHERE ( 
       (body ~* 'Canada')::int 
     + (body ~* 'United States')::int
     + (body ~* 'Mexico')::int
 ) >= 2
Sign up to request clarification or add additional context in comments.

Comments

1

One method is a separate comparison for each and add up the matches:

WHERE ( (body ~* 'Canada')::int + (body ~* 'United States')::int + (body ~* 'Mexico)::int) >= 2

However, it might be better to split the text and use array functions:

WHERE string_to_array(body, ' ') @> array['Canada', 'Mexico', 'United States']

Of course, the exact splitting logic depends on what body looks like.

Another fun method is a lateral join:

SELECT c.* 
FROM comments c CROSS JOIN LATERAL
     (SELECT COUNT(*) as num_matches
      FROM (VALUES ('Canada'), ('Mexico'), ('United States')) v(str)
      WHERE c.body ~* v.str  -- or use `like`
     ) x
WHERE num_matches >= 2;

1 Comment

Fantastic idea! Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.