5

I'm trying to parse an HTML snippet, using the PHP DOM functions. I have stripped out everything apart from paragraph, span and line break tags, and now I want to retrieve all the text, along with its accompanying styles.

So, I'd like to get each piece of text, one by one, and for each one I can then go back up the tree to get the values of particular attributes (I'm only interested in some specific ones, like color etc.).

How can I do this? Or am I thinking about it the wrong way?

3
  • The code could be anything (well, within reason). It's coming from TinyMCE, and then I'm stripping out everything apart from spans and paragraphs. Commented Jan 24, 2011 at 13:04
  • please show the PHP DOM code you are using on the input coming from TinyMCE Commented Jan 24, 2011 at 13:06
  • Currently I'm not doing anything - haven't got that far yet! I'm trying to work out where to start! Commented Jan 24, 2011 at 14:08

2 Answers 2

10

Suppose you have a DOMDocument here:

$doc = new DOMDocument();
$doc->loadHTMLFile('http://stackoverflow.com/');

You can find all text nodes using a simple Xpath.

$xpath = new DOMXpath($doc);
$textNodes = $xpath->query('//text()');

Just foreach over it to iterate over all textnodes:

foreach ($textNodes as $textNode) {
    echo $textNode->data . "\n";
}

From that, you can go up the DOM tree by using ->parentNode.

Hope that this can give you a good start.

Sign up to request clarification or add additional context in comments.

Comments

3

For those who are more comfortable with CSS3 selectors, and are willing to include a single extra PHP class into their project, I would suggest the use of Simple PHP DOM parser. The solution would look something like the following:

$html = file_get_html('http://www.example.com/');

$ret = $html->find('p, span');    
$store = array();

foreach($ret as $element) {
    $store[] = array($element->tag => array('text' => $element->innertext, 
                                            'color' => $element->color, 
                                            'style' => $element->style));
}
print_r($store);

3 Comments

Suggested third party alternatives to SimpleHtmlDom that actually use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.
SimpleHtmlDom uses string parsing? That is something I did not know.
have a look at it's source ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.