3

I have event data that looks like this:

 id | instance_id | value
 1  | 1           | a
 2  | 1           | ap
 3  | 1           | app
 4  | 1           | appl
 5  | 2           | b
 6  | 2           | bo
 7  | 1           | apple
 8  | 2           | boa
 9  | 2           | boat
10  | 2           | boa
11  | 1           | appl
12  | 1           | apply

Basically, each row is a user typing a new letter. They can also delete letters.

I'd like to create a dataset that looks like this, let's call it data

 id | instance_id | value
 7  | 1           | apple
 9  | 2           | boat
12  | 1           | apply

My goal is to extract all the complete words in each instance, accounting for deletion as well - so it's not sufficient to just get the longest word or the most recently typed.

To do so, I was planning to do a regex operation like so:

select * from data
where not exists (select * from data d2 where d2.value ~ (d.value || '.'))

Effectively I'm trying to build a dynamic regex that adds matches one character more than is present, and is specific to the row it's matching against.

The code above doesn't seem to work. In Python, I can "compile" a regex pattern before I use it. What is the equivalent in PostgreSQL to dynamically build a pattern?

5
  • How do you know what a word is? I mean, why isn't "boa" and "a" and "app" in your list? Commented Jul 12, 2018 at 19:52
  • Sounds more like you want to compare all values against a dictionary. I cannot imagine how regular expressions should help here (at least to do the job completely on their own). Commented Jul 12, 2018 at 20:08
  • Hey guys - actually not interested in dictionary words at all (our actual data isn't dictionary words anyway). We see two triangular patterns, I'm interested in the "peaks" ie: a > aa > aaa > aaaa > aaa > aa > aab has two peaks: aaaa and aab. Commented Jul 12, 2018 at 21:30
  • dbfiddle.uk/… Commented Jul 12, 2018 at 22:06
  • Abelisto, can you share that as an answer? Commented Jul 12, 2018 at 22:10

2 Answers 2

1

Try simple LIKE operator instead of regex patterns:

SELECT * FROM data d1
WHERE NOT EXISTS (
  SELECT * FROM data d2
  WHERE d2.value LIKE d1.value ||'_%'
)

Demo: https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=cd064c92565639576ff456dbe0cd5f39

Create an index on value column, this should speed up the query a bit.

Sign up to request clarification or add additional context in comments.

1 Comment

This worked for my base case, thank you! Is there a way to build a dynamic pattern though?
1

To find peaks in the sequential data window functions is a good choice. You just need to compare each value with previous and next ones using lag() and lead() functions:

with cte as (
  select 
    *, 
    length(value) > coalesce(length(lead(value) over (partition by instance_id order by id)),0) and
    length(value) > coalesce(length(lag(value) over (partition by instance_id order by id)),length(value)) as is_peak
  from data)
select * from cte where is_peak order by id;

Demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.