0

I have the following data

<description>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div class="MsoNormal"&gt;&lt;i&gt;&lt;span style="font-family: Georgia, Times New Roman, serif; font-size: xx-small;"&gt;By Marina Correa&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;&lt;div class="MsoNormal"&gt;&lt;i&gt;&lt;span style="font-family: Georgia, Times New Roman, serif; font-size: xx-small;"&gt;Photography: Courtesy the architect&lt;/span&gt;&lt;span style="font-family: Georgia, serif; font-size: 9pt;"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/i&gt;&lt;/div&gt;&lt;div class="MsoNormal"&gt;&lt;br&gt;&lt;/div&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-D1JRy4epwOM/UooCcR-U7lI/AAAAAAAALyM/tDr2ezxnb-I/s1600/Prost_Beer_+House_AH_Design_Indiaartndesign.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img alt="Prost Beer House in Bengaluru, India,by AH design." border="0" src="http://3.bp.blogspot.com/-D1JRy4epwOM/UooCcR-U7lI/AAAAAAAALyM/tDr2ezxnb-I/s1600/Prost_Beer_+House_AH_Design_Indiaartndesign.jpg" title=""&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: right;"&gt;&lt;span style="font-family: Arial, Helvetica, sans-serif; font-size: xx-small;"&gt;.&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div class="MsoNormal"&gt;&lt;br&gt;&lt;/div&gt;&lt;div class="MsoNormal"&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-family: Georgia, &amp;#39;Times New Roman&amp;#39;, serif;"&gt;Evolving from carnage of shipwrecked metal, the interiors of Prost Beer House in Bengaluru, India, make it an attention-grabbing drinking hole…&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;a href="http://inditerrain.indiaartndesign.com/2013/11/beerhouse-rock.html#more"&gt;Read more »&lt;/a&gt;&lt;img src="http://feeds.feedburner.com/~r/IndiaArtNDesign/~4/jGC75D3KB0o" height="1" width="1"/&gt;</description>

however instead of "<" i have "& lt;" and instead of ">" i have "& gt;"

i need a regular expression to find the data not inside the html tags ie the actual text and not the names of the tags, class name etc...

for parsing the html with "<" and ">" i found this: (?<=^|>)[^><]+?(?=<|$)

although i dont know how to convert it to suit what i need. help is much appreciated

1

4 Answers 4

1

That looks like an HTML Fragment inside a XML, more specific inside the description of a RSS feed. If this is the case you should parse the RSS using DOM, this will decode the entities a long the way:

$dom = new DOMDocument();
$dom->loadXml($rss);
$xpath = new DOMXpath($dom);

Iterate the items:

foreach ($xpath->evaluate('/rss/channel/item') as $rssItem) {

The title of an item is only a text value it can be used directly:

  echo 'Title: ', $xpath->evaluate('string(title)', $rssItem), "\n";

The description in your example contains the html fragment in a text node with escaped entities, I have seen other example with a CDATA. It doesn't really matter for the outer xml document. It is text and if you read is as text the entities will get transformed back into their respective characters.

  $description = $xpath->evaluate('string(description)', $rssItem);

So now $description contains < and > again. It can be loaded into a DOM with loadHtml() or just cleaned up with strip_tags().

  echo 'Description: ', strip_tags($description), "\n\n";

A full example (RSS adapted from Wikipedia):

$rss = <<<'RSS'
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel> 
 <item>
  <title>Example entry</title>
  <description>Here is some &lt;b&gt;text&lt;/b&gt; containing an interesting &lt;i&gt;description&lt;/i&gt; with &lt;span class="important"&gt;html&lt;/span&gt;.</description>
 </item>
</channel>
</rss>
RSS;

$dom = new DOMDocument();
$dom->loadXml($rss);
$xpath = new DOMXpath($dom);

foreach ($xpath->evaluate('/rss/channel/item') as $rssItem) {
  echo 'Title: ', $xpath->evaluate('string(title)', $rssItem), "\n";
  $description = $xpath->evaluate('string(description)', $rssItem);
  echo 'Description: ', strip_tags($description), "\n\n";
}

Output:

Title: Example entry
Description: Here is some text containing an interesting description with html.
Sign up to request clarification or add additional context in comments.

1 Comment

+1 This is the only way for profit... Two years since this post, and still so many questions with both xml and regex tags.
0

for decoding you can user htmlspecialchars_decode

for more detail please check http://php.net/manual/en/function.htmlspecialchars-decode.php

Comments

0

To obtain quickly the raw text (without tags) you can do this replacement:

$result = preg_replace('~&lt;.*?&gt;~s', ' ', $source);

1 Comment

source and result are outputting exactly the same after trying this.. can u please explain the regex?
0

This gives you all the texts you're seeking as an array:

preg_match_all("/(?<=&gt;)(?!&lt;).*?(?=&lt;)/", $source, $result);

See a live demo of this regex working with your sample input.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.