Extract HTML text using Mysql

Question

I am stuck in a situation where I have a field in mysql which is a long html field. I need to extract the words between html tags.

Say,

<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>

I need something like this.

"My First Heading My first paragraph"

I am currently doing it in java from an exported csv file using a function like that of the following:

public String getStringFromHtml(String html) {
         String nohtml = html.toString().replaceAll("<[^>]*>"," ");
         return nohtml.trim().replaceAll("\\s+", " ");
}

But lets assume that I am just using Mysql workbench (no server side scripts) for some data analysis.

I was still wondering whether there are any ways that Mysql allow us to eliminate the html tags and extract the words in between. I tried to search all over stack overflow & google, I was not lucky as there is only advice to use it in PHP or java or stored procedures.

Still there is no way to extract html text using SQL ?

Have you checked the functions, to see if any of them would work for you? dev.mysql.com/doc/refman/5.5/en/string-functions.html — Darius X.
– Darius X., Commented Oct 27, 2014 at 15:04
Darius, I had a chance to take a look at these functions. But, it is more or less replacing a pattern of strings (in my case it is html tags) dynamically using a regex is the challenge here. I am not sure sql will allow to replace such patterns. — Logan
– Logan, Commented Oct 28, 2014 at 3:04

Bill Karwin · Accepted Answer · 2014-10-28 03:06:48Z

2

You can use the ExtractValue() function to give an XPath expression that will pick out the part you need:

mysql> SELECT html FROM mytable;
+----------------------------------------------------------------------------------------------+
| html                                                                                         |
+----------------------------------------------------------------------------------------------+
| <!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html> |
+----------------------------------------------------------------------------------------------+

SELECT ExtractValue(html, '//html/body/p[1]') AS value FROM mytable;
+---------------------+
| value               |
+---------------------+
| My first paragraph. |
+---------------------+

answered Oct 28, 2014 at 3:06

Bill Karwin

567k87 gold badges709 silver badges869 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Logan Over a year ago

Thanks, it works but we can only extract when we give static & known XPaths. In my case, Each row will have unpredictable XPaths. The final objective of this problem is to count the number of words present in the html field for each row excluding html tags. So, when i follow this I need to go with loops or recursion which means I either need to go with stored procedures or in a programatical way ?

Bill Karwin Over a year ago

I suggest you learn more about XPath, because it's pretty powerful. I'm not sure if MySQL's function implements XPath fully, but it might do what you want. See docs.oracle.com/javase/tutorial/jaxp/xslt/xpath.html

Collectives™ on Stack Overflow

Extract HTML text using Mysql

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related