Splitting a html text into characters and html tags (with PHP & MySQL)

Question

I want to store a html text into database as splitted to individual characters. Since, the text is long and the process is frequence, performance is of particular importance. Thus, I need to find an efficient way to conudct this in PHP without overload of building multiple arrays.

Of course, the purpose is simple text with a few markup html tags, without nested nodes. It can be considered for BBCode or something like that. I just want to have this possibility to skip some tags in this split process.

Example:

$html='This <i>is</i> a <strong>test</test>';

This string should be stored in mysql database as

id  character  html_tag
1    T
2    h
3    i
4    s
5
6    i          italic
7    s          italic
8
9    a
10
11   t          strong
12   e          strong
13   s          strong
14   t          strong
15   !

How to capture the individual characters without corresponding html tags?

How do you plan on storing <strong>nested <em>tags</em> </strong>? But if performance & correctness is an issue, go with xmlreader. — Wrikken
– Wrikken, Commented Dec 14, 2012 at 20:14
@Wrikken 1. If finding a practical solution, it can be somehow extended to child nodes too, 2. I am talking about simple html text (consider even bbcodes), otherwise, it is impossible to do this with nested DIVs. — Googlebot
– Googlebot, Commented Dec 14, 2012 at 20:46
What you are doing is parsing HTML. You need to use an HTML parser to do this. See htmlparsing.com/php.html for examples and pointers to libraries. See also stackoverflow.com/questions/292926/… — Andy Lester
– Andy Lester, Commented Dec 14, 2012 at 20:56
@AndyLester no it is no parsing HTML, as I edited the question, it can be for a case other than HTML. It is just a process to skip some tags during split process. — Googlebot
– Googlebot, Commented Dec 14, 2012 at 20:59
@AndyLester Regarding the tags, yes it is connected with mysql, as this process can even be done by mysql functions, since the target is to be stored in database. It is just easier to do this in PHP, but no obligation. — Googlebot
– Googlebot, Commented Dec 14, 2012 at 21:02

maciej-ka · Accepted Answer · 2012-12-14 21:34:40Z

2

Parse Html with fast XMLReader.

This code will also work with nested tags, $tags variable is stack of tags. Here I always echo the most nested tag, the last one in stack.

$html='This <i>is</i> a <strong>test</strong>!';

$reader=new XMLReader();
$reader->XML('<root>'.$html.'</root>');
// skip root node
$reader->read();
$tags=array('');
while($reader->read())
    switch($reader->nodeType)
    {
        case $reader::ELEMENT:
            $tags[]=$reader->name;
            break;
        case $reader::END_ELEMENT;
            array_pop($tags);
            break;
        default:
            for($i=0;$i<strlen($reader->value);$i++)
                // your insert sql here
                echo "<br/>'".$reader->value[$i]."' ".end($tags);
    }

Also, because speed is crucial, consider buffering inserts into string and running them as a batch:

INSERT INTO tname (character,html_tag) VALUES('T',''),('h','');

edited Dec 14, 2012 at 21:34

answered Dec 14, 2012 at 20:05

maciej-ka

5753 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Splitting a html text into characters and html tags (with PHP & MySQL)

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related