2

I am stuck in a situation where I have a field in mysql which is a long html field. I need to extract the words between html tags.

Say,

<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>

I need something like this.

"My First Heading My first paragraph"

I am currently doing it in java from an exported csv file using a function like that of the following:

public String getStringFromHtml(String html) {
         String nohtml = html.toString().replaceAll("<[^>]*>"," ");
         return nohtml.trim().replaceAll("\\s+", " ");
}

But lets assume that I am just using Mysql workbench (no server side scripts) for some data analysis.

I was still wondering whether there are any ways that Mysql allow us to eliminate the html tags and extract the words in between. I tried to search all over stack overflow & google, I was not lucky as there is only advice to use it in PHP or java or stored procedures.

Still there is no way to extract html text using SQL ?

2
  • Have you checked the functions, to see if any of them would work for you? dev.mysql.com/doc/refman/5.5/en/string-functions.html Commented Oct 27, 2014 at 15:04
  • Darius, I had a chance to take a look at these functions. But, it is more or less replacing a pattern of strings (in my case it is html tags) dynamically using a regex is the challenge here. I am not sure sql will allow to replace such patterns. Commented Oct 28, 2014 at 3:04

1 Answer 1

2

You can use the ExtractValue() function to give an XPath expression that will pick out the part you need:

mysql> SELECT html FROM mytable;
+----------------------------------------------------------------------------------------------+
| html                                                                                         |
+----------------------------------------------------------------------------------------------+
| <!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html> |
+----------------------------------------------------------------------------------------------+

SELECT ExtractValue(html, '//html/body/p[1]') AS value FROM mytable;
+---------------------+
| value               |
+---------------------+
| My first paragraph. |
+---------------------+
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, it works but we can only extract when we give static & known XPaths. In my case, Each row will have unpredictable XPaths. The final objective of this problem is to count the number of words present in the html field for each row excluding html tags. So, when i follow this I need to go with loops or recursion which means I either need to go with stored procedures or in a programatical way ?
I suggest you learn more about XPath, because it's pretty powerful. I'm not sure if MySQL's function implements XPath fully, but it might do what you want. See docs.oracle.com/javase/tutorial/jaxp/xslt/xpath.html

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.