2

I have a comments column and the comments added to release are stored as rich text in comments column. Now i'm trying to process this data and get an human readable output. I'm providing 2 sample comment data i have in my oracle SQL db which i'm trying to process.

Example 1 :

    <html>
<body>
<div align="left"><font face="Arial Unicode MS"><span style="font-size:8pt">Display the frulog on the count values</span></font></div>
</body>
</html>

Example 2: <not implemented in this release>

i used the below query to process the html characters

Select (REGEXP_REPLACE(comments),'<.+?>') from test_table;

Note: consider values provided in Example 1 and Example 2 are passed in as column comments in the above SQL command.

the query result for Example 1 was Display the frulog on the count values which is what i'm expecting. result for Example 2 was ''. Value in Example 2 was not an html tag but it still replaced the tags. How can i make the replace statement smart.

Feel free to drop your suggestions .

2
  • Maybe this regex? It can be used as REGEX_REPLACE(comments, regex, '', 1, 0, 'in') Commented Jun 13, 2018 at 7:59
  • Seems to be a frequent question without a clear answer: stackoverflow.com/questions/240546/… Commented Jun 13, 2018 at 9:43

1 Answer 1

3

SQL Fiddle

Oracle 11g R2 Schema Setup:

CREATE TABLE comments ( value ) AS
SELECT '<html>
<body>
<div align="left">
<font face="Arial Unicode MS">
<span style="font-size:8pt">
  Display the frulog on the count values
</span>
</font>
</div>
</body>
</html>' FROM DUAL UNION ALL
SELECT '<not implemented in this release>' FROM DUAL UNION ALL
SELECT '<test a="1"
  b=''2''
  c = 3
  d
  e = ">" >test</test>' FROM DUAL;

Query 1:

SELECT value,
       REGEXP_REPLACE(
         value,
         '\s*</?\w+((\s+\w+(\s*=\s*(".*?"|''.*?''|[^''">\s]+))?)+\s*|\s*)/?>\s*',
         NULL,
         1,
         0,
         'im'
       ) AS replaced
FROM   comments

Results:

|                                    VALUE |                               REPLACED |
|------------------------------------------|----------------------------------------|
| <html>                                   | Display the frulog on the count values |
| <body>                                   |                                        |
| <div align="left">                       |                                        |
| <font face="Arial Unicode MS">           |                                        |
| <span style="font-size:8pt">             |                                        |
| Display the frulog on the count values   |                                        |
| </span>                                  |                                        |
| </font>                                  |                                        |
| </div>                                   |                                        |
| </body>                                  |                                        |
| </html>                                  |                                        |
|------------------------------------------|----------------------------------------|
| <not implemented in this release>        |                                 (null) |
|------------------------------------------|----------------------------------------|
| <test a="1"                              |                                   test | 
|   b='2'                                  |                                        |
|   c = 3                                  |                                        |
|   d                                      |                                        |
|   e = ">" >test</test>                   |                                        |

Note: <not implemented in this release> is a valid HTML custom element with tag name not and attributes implemented, in, this and release.

If you only want to replace specific HTML elements then list them at the start of the regular expression:

\s*</?(a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdi|bdo|bgsound|big|blink|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|command|content|data|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|element|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h1|head|header|hgroup|hr|html|i|iframe|image|img|input|ins|isindex|kbd|keygen|label|legend|li|link|listing|main|map|mark|marquee|menu|menuitem|meta|meter|multicol|nav|nextid|nobr|noembed|noframes|noscript|object|ol|optgroup|option|output|p|param|picture|plaintext|pre|progress|q|rp|rt|rtc|ruby|s|samp|script|section|select|shadow|slot|small|source|spacer|span|strike|strong|style|sub|summary|sup|table|tbody|td|template|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video|wbr|xmp)((\s+\w+(\s*=\s*(".*?"|''.*?''|[^''">\s]+))?)+\s*|\s*)/?>\s*
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.