2

I am trying to figure out how to compare two columns consisting of phrases of strings that may be in different order in Oracle SQL. If both columns contain the same phrases, they would be duplicate even if the order of the phrases might be different. For example, given the columns below (Table1.column1 and Table1.column2), I want to produce Duplicate? column.

Table1.column1                 Table1.column2             Duplicate?
=====================================================================
ABC DEF                        DEF ABC                    Y
ABC DEF                        GHI ABC                    N
ABCD EFGH IJKL MNOP            IJKL MNOP ABCD EFGH        Y
ABCD EFGH IJKL MNOP            IJKL QRST EFGH ABCD        N
ABC ABC DEF                    DEF ABC DEF                N 

I did a bit of research and I think I have to use either LIKE function or REGEXP_LIKE but cannot even really create a concrete idea in my head to approach this.

Additional info:

  • Only spaces are used to separate the tokens.
  • The strings do not contain null.
  • Different counts of the same tokens are not duplicates.

Any help would be appreciated!!

7
  • Even regex might not completely cover this. Would it be possible for you to change your table design and include some normalization? Commented Mar 8, 2017 at 1:01
  • Haha, the purpose of coding this is to organize and delete redundant stuff in the database... Wonder if it is even possible.. Commented Mar 8, 2017 at 1:07
  • Does this have to be done in the database? Commented Mar 8, 2017 at 1:10
  • It does have to be in the database... Commented Mar 8, 2017 at 1:16
  • This can be done in the database, and even using just SQL (no need for procedures or Java etc.) With that said, a few questions: (1) are all the values sufficiently short (less than 4000 bytes) or are they CLOBs? (2) what are the "tokens" - the substrings of any characters other than space, with space as separator? Can there be other "white space" in the strings, like tabs or chr(10) for newline, etc.? (3) Can there be duplicates, and if so do they need to match in number? That is: are 'ABC ABC DEF' and 'DEF ABC DEF' equal or different? (They have the same "words" but different counts) Commented Mar 8, 2017 at 1:37

2 Answers 2

2

You can use like and regex in following way to get the required output:

select dummy_table.column1, dummy_table.column2,
(case when dummy_table.column1 like  ('%' || dummy_table.a || '%')
AND dummy_table.column1 like  ('%' || dummy_table.b || '%')
AND dummy_table.column1 like  ('%' || dummy_table.c || '%')
AND dummy_table.column1 like  ('%' || dummy_table.d || '%')
AND length(abc.column1) = length(abc.column2) THEN 'Y' ELSE 'N' END)  as Duplicate 
from
(select column1,column2, regexp_substr(column2, '[^ ]+', 1, 1) as a, regexp_substr(column2, '[^ ]+', 1, 2) as b, regexp_substr(column2, '[^ ]+', 1, 3) as c,
regexp_substr(column2, '[^ ]+', 1, 4) as d
from Table1 ) dummy_table;

Required Result:

Table1.column1                 Table1.column2             Duplicate?
=====================================================================
ABC DEF                        DEF ABC                    Y
ABC DEF                        GHI ABC                    N
ABCD EFGH IJKL MNOP            IJKL MNOP ABCD EFGH        Y
ABCD EFGH IJKL MNOP            IJKL QRST EFGH ABCD        N
ABC ABC DEF                    DEF ABC DEF                N 

Note: The number of 'regexp_substr' and 'like in case' depends on the max number of phrases present for a value in table.

Sign up to request clarification or add additional context in comments.

2 Comments

In other words, if the column has up to 10 tokens we would need 10 regexp_substr and 10 like clauses.
@APC: Yes, as we are using like query to compare each token of phrase (like query is directly proportional to max no of phrases), If number of phrases in a row is too large maybe some sort of looping can do the trick.
1

Regex splitting works neatly on a single string. The snag is the usual approach spawns a cartesian product on multiple rows i.e. when used on a table. My query nicks a clever solution from Alex Nuitjen.

To break it down: the first two sub-queries tokenize the cols, the third sub-query re-aggregates them in alphabetical order, and the main query evaluates them for duplication:

with col1 as (
     select id, col1, regexp_substr(col1,'[^ ]+', 1, rn) as tkn
     from t42
      cross join (select rownum rn
            from (select max ( regexp_count(col1,' ')+1) + 1 mx from t42)
         connect by level <= mx
         )
    where regexp_substr(col1,'[^ ]+', 1, rn) is not null
   order by id
     )
    , col2 as (
     select id, col2, regexp_substr(col2,'[^ ]+', 1, rn) as tkn
     from t42
      cross join (select rownum rn
            from (select max ( regexp_count(col2,' ')+1) + 1 mx from t42)
         connect by level <= mx
         )
    where regexp_substr(col2,'[^ ]+', 1, rn) is not null
   order by id
     )
   , ccat as ( 
        select col1.id
               , col1.col1
               , listagg(col1.tkn, ' ') within group (order by col1.tkn) as catcol1
               , col2.col2
               , listagg(col2.tkn, ' ') within group (order by col2.tkn) as catcol2
        from col1
              join col2 on col1.id = col2.id
        group by col1.id, col1.col1, col2.col2 )
select ccat.id
       , ccat.col1
       , ccat.col2
       , case when ccat.catcol1=ccat.catcol2 then 'Y' else 'N' end as duplicate
from ccat
order by ccat.id
/

I assume you have a key column (ID in my code).

Although this solution is more verbose than the one proposed by @shaileshyadav it does have the advantage of scaling for any number of tokens. Given this test data ...

SQL> select * from t42
  2  /


        ID COL1                    COL2
---------- ----------------------- -----------------------
         1 ABC DEF                 DEF ABC
         2 ABC DEF                 GHI ABC
         3 ABCD EFGH IJKL MNOP     IJKL MNOP ABCD EFGH
         4 ABCD EFGH IJKL MNOP     IJKL QRST EFGH ABCD
         5 ABC ABC DEF             DEF ABC DEF
         6 AAA BBB CCC DDD EEE     AAA BBB CCC DDD
         7 AAA BBB CCC DDD EEE     AAA BBB CCC DDD EEE
         8 XXX YYYY ZZZ AAA BBB    AAA BBB XXX ZZZ YYYY
         9 A B C D E F G H I J K L L K J I H G F E D C B A
        10 AA BB CC DD EE          AA BB CC DD FF

10 rows selected.


SQL> 

... the query output is :

        ID COL1                    COL2                    D
---------- ----------------------- ----------------------- -
         1 ABC DEF                 DEF ABC                 Y
         2 ABC DEF                 GHI ABC                 N
         3 ABCD EFGH IJKL MNOP     IJKL MNOP ABCD EFGH     Y
         4 ABCD EFGH IJKL MNOP     IJKL QRST EFGH ABCD     N
         5 ABC ABC DEF             DEF ABC DEF             N
         6 AAA BBB CCC DDD EEE     AAA BBB CCC DDD         N
         7 AAA BBB CCC DDD EEE     AAA BBB CCC DDD EEE     Y
         8 XXX YYYY ZZZ AAA BBB    AAA BBB XXX ZZZ YYYY    Y
         9 A B C D E F G H I J K L L K J I H G F E D C B A Y
        10 AA BB CC DD EE          AA BB CC DD FF          N

10 rows selected.

SQL> 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.