How do I extract data between two strings based on a pattern in Oracle SQL

Question

I want to extract the data from a column which is of type CLOB in oracle SQL based on a specific pattern. I tried different things with regex nothing worked so far.

PFB the example on how the data would look like and the expected output. Sample Data:

I should extract CLOB column preceding the word LIST until one word before the .(dot) PS: CLOB can have CR LF / Carriage return within the pattern. Expected Output:

What you want to extract doesn't conform with the value of col_b. If you want to extract before the dot, then [Location=BLAH] and [SOC] should also be in the result set. — Barbaros Özhan
– Barbaros Özhan, Commented Feb 14, 2021 at 15:16
The input strings include newline characters (as you said yourself). Those newlines disappeared in your output. Why is that? Is that part of the requirement? You didn't mention it; if it is part of the requirement, you should say so. — user5683823
– user5683823, Commented Feb 14, 2021 at 16:31
Also, please post the sample data as plain text (not image) - we can't use what you posted for testing our solutions. — user5683823
– user5683823, Commented Feb 14, 2021 at 16:36

user5683823 · Accepted Answer · 2021-02-14 20:39:07Z

Here is how I would do this. Note a couple of things:

The output preserves newlines that existed in the input. You didn't say anything about removing them; however, your output doesn't show them. In any case - they can be removed, if needed, but that is an unrelated process.
You say "word" but obviously you are using that in a sense different from the common usage in regular expressions. In regexp, "word characters" are only letters, digits and underscore; yet your "words" include brackets, equal sign, and who knows what else. I interpreted the term "word" to mean any sequence of consecutive non-whitespace characters.

Here is how we can recreate your data. When you ask a question here, this is how you should provide sample data - not as an image that we can't copy and paste in an SQL editor.

CREATE TABLE sample_data( col_a varchar2(20), col_b CLOB );

INSERT INTO sample_data VALUES
('12345', to_clob(
'Created:2/28/2019
Updated:1/19/2021
LIST:[ABC][DEF][GHI]
[LMNO][PQRST]
[Location=BLAH].[City=BLAH]'));

INSERT INTO sample_data VALUES
('12346', to_clob(
'Created:2/28/2019
Updated:1/19/2021
LIST:[ABC][DEF][GHI]
[LMNO][PQRST]
[SOC].[RAW]'));

commit;

Then here is the query and the output. Note that, depending on your interface (in my case: SQL Developer, which uses a SQL*Plus-like interface), you may need to change some settings so that the output is not truncated. In particular, in SQL*Plus, CLOB columns are truncated to 80 characters by default; I had to

set long 100

So - query and output:

select col_a, col_b,
       regexp_substr(col_b, '(\s|^)(LIST:[^.]*?)\s+\S+\.', 1, 1, null, 2)
         as result
from   sample_data
;

COL_A COL_B                          RESULT                        
----- ------------------------------ ------------------------------
12345 Created:2/28/2019              LIST:[ABC][DEF][GHI]          
      Updated:1/19/2021              [LMNO][PQRST]                 
      LIST:[ABC][DEF][GHI]                                         
      [LMNO][PQRST]                                                
      [Location=BLAH].[City=BLAH]                                  

12346 Created:2/28/2019              LIST:[ABC][DEF][GHI]          
      Updated:1/19/2021              [LMNO][PQRST]                 
      LIST:[ABC][DEF][GHI]                                         
      [LMNO][PQRST]                                                
      [SOC].[RAW]

The regular expression matches a single whitespace character or the beginning of the string ((\s|^)), then the characters LIST: followed by as few consecutive, non-period characters (this will match spaces and newline characters, in particular) as needed to allow a match - which continues with one or more whitespace characters, followed by a single word (string of 1 or more non-whitespace characters) and a literal period (\.).

The expression we must return is enclosed in parentheses, so that we can return it from regexp_substr. Such an expression is called a "capture group". The regexp includes another capture group, (\s|^), out of necessity (alternation), so the capture group we must return is the second in the regexp. This is what the last argument to regexp_substr does: it instructs the function to return the second capture group.

Note a peculiar thing about the period (related to the much more general concept of escaping within bracket expressions): the period must be escaped to represent a literal period, rather than "any character", at the end of the regular expression; however, within the (negated) bracket expression [^.]*?, the period - representing a literal period, not "any character" - is not escaped. Oracle follows the ERE (extended regular expressions) dialect of the POSIX standard, and that standard says that escape sequences are invalid within bracket expressions. This is different from other regular expression dialect, and confuses a lot of users.

Barbaros Özhan · Accepted Answer · 2021-02-14 15:45:16Z

1

One option would be using REPLACE() in order to remove line feed (CHR(10)) and carriage return (CHR(13)), then REGEXP_REPLACE() functions recursively in order to extract the substring after LIST: upto the dot such as

SELECT col_a,
       'LIST:'||REGEXP_REPLACE(REPLACE(REPLACE(col_b,CHR(10)),CHR(13)),'(.*LIST:)(\S+)(\..*)','\2') AS result
  FROM t;

col_a    result
------   -------
12345    LIST:[ABC][DEF][GHI][LMNO][PQRST][Location=BLAH]
12346    LIST:[ABC][DEF][GHI][LMNO][PQRST][SOC]

Demo

edited Feb 14, 2021 at 15:45

answered Feb 14, 2021 at 15:37

Barbaros Özhan

65.9k11 gold badges36 silver badges64 bronze badges

Comments

Steve Lovell · Accepted Answer · 2021-02-14 15:07:16Z

0

There may be more efficient ways to do this, but the following seems to work:

First I replace newline characters with spaces using TRANSLATE, then using regex find anything between LIST: and .. Then I remove the final "word" using SUBSTR and INSTR. I've used a subquery to prevent having to repeat the first steps.

SELECT
  SubQuery.COL_A,
  SUBSTR(SubQuery.WithWordAndDot, 1, INSTR(SubQuery.WithWordAndDot,' ',-1)-1) AS Result
FROM 
(
SELECT
  COL_A,
  REGEXP_SUBSTR(TRANSLATE(COL_B, CHR(10)||CHR(13), ' '),'LIST:[^\.]+\.') as WithWordAndDot
FROM MyTable
 ) SubQuery
 ;

answered Feb 14, 2021 at 15:07

Steve Lovell

2,5742 gold badges15 silver badges18 bronze badges

Collectives™ on Stack Overflow

How do I extract data between two strings based on a pattern in Oracle SQL

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related