Oracle - need to extract text between given strings

Question

Example - need to extract everything between "Begin begin" and "End end". I tried this way:

with phrases as (
  select 'stackoverflow is awesome. Begin beginHello, World!End end It has everything!' as phrase
    from dual
         )
select regexp_replace(phrase
     , '([[:print:]]+Begin begin)([[:print:]]+)(End end[[:print:]]+)', '\2')
  from phrases
       ;

Result: Hello, World!

However it fails if my text contains new line characters. Any tip how to fix this to allow extracting text containing also new lines?

[edit]How does it fail:

with phrases as (
  select 'stackoverflow is awesome. Begin beginHello, 
  World!End end It has everything!' as phrase
    from dual
         )
select regexp_replace(phrase
     , '([[:print:]]+Begin begin)([[:print:]]+)(End end[[:print:]]+)', '\2')
  from phrases
       ;

Result:

stackoverflow is awesome. Begin beginHello, World!End end It has everything!

Should be:

Hello,
World!

[edit]

Another issue. Let's see to this sample:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!End endTESTESTESTES' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(phrase, '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
  FROM phrases;

Result:

Hello,
World!End end It has everything!

So it matches last occurence of end string and this is not what I want. Subsgtring should be extreacted to first occurence of my label, so result should be:

Hello,
World!

Everything after first occurence of label string should be ignored. Any ideas?

Interesting problem. I can't work out the solution but I'm watching for who does. :) — mmmmmpie
– mmmmmpie, Commented Feb 23, 2015 at 13:44
While Stephan and David Faber have a nice solution, it pays to see how others have addressed variations of the new line character in general as it pertains to regular expression in Oracle. I find it informative to see how @APC did this here, stackoverflow.com/questions/16407135/… — Patrick Bacon
– Patrick Bacon, Commented Feb 23, 2015 at 15:18

David Faber · Accepted Answer · 2015-02-24 12:28:51Z

6

I'm not that familiar with the POSIX [[:print:]] character class but I got your query functioning using the wildcard .. You need to specify the n match parameter in REGEXP_REPLACE() so that . can match the newline character:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(phrase, '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
  FROM phrases;

I used the \1 backreference as I didn't see the need to capture the other groups from the regular expression. It might also be a good idea to use the * quantifier (instead of +) in case there is nothing preceding or following the delimiters. If you want to capture all of the groups then you can use the following:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(phrase, '(.+Begin begin)(.+)(End end.+)', '\2', 1, 1, 'n')
  FROM phrases;

UPDATE - FYI, I tested with [[:print:]] and it doesn't work. This is not surprising since [[:print:]] is supposed to match printable characters. It doesn't match anything with an ASCII value below 32 (a space). You need to use ..

UPDATE #2 - per update to question - I don't think a regex will work the way you want it to. Adding the lazy quantifier to (.+) has no effect and Oracle regular expressions don't have lookahead. There are a couple of things you might do, one is to use INSTR() and SUBSTR():

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!End endTESTTESTTEST' AS phrase
    FROM dual
)
SELECT SUBSTR(phrase, str_start, str_end - str_start) FROM (
    SELECT INSTR(phrase, 'Begin begin') + LENGTH('Begin begin') AS str_start
         , INSTR(phrase, 'End end') AS str_end, phrase
      FROM phrases
);

Another is to combine INSTR() and SUBSTR() with a regular expression:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!End endTESTTESTTEST' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(SUBSTR(phrase, 1, INSTR(phrase, 'End end') + LENGTH('End end')), '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
  FROM phrases;

edited Feb 24, 2015 at 12:28

answered Feb 23, 2015 at 13:59

David Faber

12.5k2 gold badges33 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Srini V Over a year ago

'n' is the key as a matter of fact to ignore new lines

user1209216 Over a year ago

I actually don't want to ignore new lines. I need to extract untouched text between known strings.

user1209216 Over a year ago

Ok, I just noticed your soultion work as I need, there is new line in resulting string,

Stephan · Accepted Answer · 2015-02-23 14:15:58Z

2

Try this regex:

([[:print:]]+Begin begin)(.+?)(End end[[:print:]]+)

Sample usage:

SELECT regexp_replace(
         phrase ,
         '([[:print:]]+Begin begin)(.+?)(End end[[:print:]]+)',
         '\2',
         1,  -- Start at the beginning of the phrase
         0,  -- Replace ALL occurences
         'n' -- Let dot meta character matches new line character
)
FROM
  (SELECT 'stackoverflow is awesome. Begin beginHello, '
    || chr(10)
    || ' World!End end It has everything!' AS phrase
  FROM DUAL
  )

The dot meta character (.) matches any character in the database character set and the new line character. However, when regexp_replace is called, the match_parameter must contain n switch for dot matches new lines.

edited Feb 23, 2015 at 14:15

answered Feb 23, 2015 at 13:41

Stephan

43.2k69 gold badges245 silver badges342 bronze badges

1 Comment

user1209216 Over a year ago

No change. Still entire string is returned.

score 0 · Accepted Answer · 2015-02-23 14:14:33Z

0

In order to get your second option to work you need to add [[:space:][:print:]]* as follows:

with phrases as (
  select 'stackoverflow is awesome. Begin beginHello, 
  World!End end It has everything!' as phrase
    from dual
         )
select regexp_replace(phrase
     , '([[:print:]]+Begin begin)([[:print:]]+[[:space:][:print:]]*)(End end[[:print:]]+)', '\2')
  from phrases
       ;

But still it will break if you have more \n, for instance it won't work for

with phrases as (
  select 'stackoverflow is awesome. Begin beginHello, 
  World!End end 
  It has everything!' as phrase
    from dual
         )
select regexp_replace(phrase
     , '([[:print:]]+Begin begin)([[:print:]]+[[:space:][:print:]]*)(End end[[:print:]]+)', '\2')
  from phrases
       ;

Then you need to add

with phrases as (
  select 'stackoverflow is awesome. Begin beginHello, 
  World!End end 
  It has everything!' as phrase
    from dual
         )
select regexp_replace(phrase
     , '([[:print:]]+Begin begin)([[:print:]]+[[:space:][:print:]]*)(End end[[:print:]]+[[:space:][:print:]]*)', '\2')
  from phrases
       ;

The problem of regex is that you might have to scope the variations and create a rule that match all of them. If something falls out of your scope, you'll have to visit the regex and add the new exception.

You can find extra info here.

edited Feb 23, 2015 at 14:14

answered Feb 23, 2015 at 14:09

user491135

2 Comments

user1209216 Over a year ago

I have large text and it can contain many new lines. I need to extract substring from it. So is there any universal solution to do this?

user491135 Over a year ago

Normally by adding the [[:space:][:print:]]* on each potential place where you can find a new line can help, what you are saying with this line is that "you might find an space or new line and letters" that's what the * is for.

Dan · Accepted Answer · 2016-09-19 16:14:16Z

 Description.........: This is a function similar to the one that was available from PRIME Computers
                       back in the late 80/90's.  This function will parse out a segment of a string
                       based on a supplied delimiter.  The delimiters can be anything.
Usage:
     Field(i_string     =>'This.is.a.cool.function'
          ,i_deliiter   => '.'
          ,i_start_pos  => 2
          ,i_occurrence => 2)

     Return value = is.a

FUNCTION field(i_string           VARCHAR2
              ,i_delimiter        VARCHAR2
              ,i_occurance        NUMBER DEFAULT 1
              ,i_return_instances NUMBER DEFAULT 1) RETURN VARCHAR2 IS
  --
  v_delimiter      VARCHAR2(1);
  n_end_pos        NUMBER;
  n_start_pos      NUMBER := 1;
  n_delimiter_pos  NUMBER;
  n_seek_pos       NUMBER := 1;
  n_tbl_index      PLS_INTEGER := 0;
  n_return_counter NUMBER := 0;
  v_return_string  VARCHAR2(32767);
  TYPE tbl_type IS TABLE OF VARCHAR2(4000) INDEX BY PLS_INTEGER;
  tbl tbl_type;
  e_no_delimiters EXCEPTION;
  v_string VARCHAR2(32767) := i_string || i_delimiter;
BEGIN
  BEGIN
    LOOP
      ----------------------------------------
      -- Search for the delimiter in the
      -- string
      ----------------------------------------
      n_delimiter_pos := instr(v_string, i_delimiter, n_seek_pos);
      --
      IF n_delimiter_pos = length(v_string) AND n_tbl_index = 0 THEN
        ------------------------------------------
        -- The delimiter you are looking for is
        -- not in this string.
        ------------------------------------------
        RAISE e_no_delimiters;
      END IF;
      --
      EXIT WHEN n_delimiter_pos = 0;
      n_start_pos := n_seek_pos;
      n_end_pos   := n_delimiter_pos - n_seek_pos;
      n_seek_pos  := n_delimiter_pos + 1;
      --
      n_tbl_index := n_tbl_index + 1;
      -----------------------------------------------
      -- Store the segments of the string in a tbl
      -----------------------------------------------
      tbl(n_tbl_index) := substr(i_string, n_start_pos, n_end_pos);
    END LOOP;
    ----------------------------------------------
    -- Prepare the results for return voyage
    ----------------------------------------------
    v_delimiter := NULL;
    FOR a IN tbl.first .. tbl.last LOOP
      IF a >= i_occurance AND n_return_counter < i_return_instances THEN
        v_return_string  := v_return_string || v_delimiter || tbl(a);
        v_delimiter      := i_delimiter;
        n_return_counter := n_return_counter + 1;
      END IF;
    END LOOP;
    --
  EXCEPTION
    WHEN e_no_delimiters THEN
      v_return_string := i_string;
  END;
  RETURN TRIM(v_return_string);
END;

Collectives™ on Stack Overflow

Oracle - need to extract text between given strings

4 Answers 4

3 Comments

Sample usage:

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Sample usage:

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related