php reg ex to find data not in html tags but identify html using < and >

Question

I have the following data

<description><div dir="ltr" style="text-align: left;" trbidi="on"><div class="MsoNormal"><i><span style="font-family: Georgia, Times New Roman, serif; font-size: xx-small;">By Marina Correa</span></i></div><div class="MsoNormal"><i><span style="font-family: Georgia, Times New Roman, serif; font-size: xx-small;">Photography: Courtesy the architect</span><span style="font-family: Georgia, serif; font-size: 9pt;"><o:p></o:p></span></i></div><div class="MsoNormal"><br></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="http://3.bp.blogspot.com/-D1JRy4epwOM/UooCcR-U7lI/AAAAAAAALyM/tDr2ezxnb-I/s1600/Prost_Beer_+House_AH_Design_Indiaartndesign.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img alt="Prost Beer House in Bengaluru, India,by AH design." border="0" src="http://3.bp.blogspot.com/-D1JRy4epwOM/UooCcR-U7lI/AAAAAAAALyM/tDr2ezxnb-I/s1600/Prost_Beer_+House_AH_Design_Indiaartndesign.jpg" title=""></a></td></tr><tr><td class="tr-caption" style="text-align: right;"><span style="font-family: Arial, Helvetica, sans-serif; font-size: xx-small;">.</span></td></tr></tbody></table><div class="MsoNormal"><br></div><div class="MsoNormal"></div><div style="text-align: justify;"><span style="font-family: Georgia, &#39;Times New Roman&#39;, serif;">Evolving from carnage of shipwrecked metal, the interiors of Prost Beer House in Bengaluru, India, make it an attention-grabbing drinking hole…</span></div></div><a href="http://inditerrain.indiaartndesign.com/2013/11/beerhouse-rock.html#more">Read more »</a><img src="http://feeds.feedburner.com/~r/IndiaArtNDesign/~4/jGC75D3KB0o" height="1" width="1"/></description>

however instead of "<" i have "& lt;" and instead of ">" i have "& gt;"

i need a regular expression to find the data not inside the html tags ie the actual text and not the names of the tags, class name etc...

for parsing the html with "<" and ">" i found this: (?<=^|>)[^><]+?(?=<|$)

although i dont know how to convert it to suit what i need. help is much appreciated

html_entity_decode(); and/or htmlspecialchars_decode(); then use a DOM parser to get your data. — Ben Fortune
– Ben Fortune, Commented Nov 21, 2013 at 11:12

ThW · Accepted Answer · 2013-11-26 18:35:19Z

That looks like an HTML Fragment inside a XML, more specific inside the description of a RSS feed. If this is the case you should parse the RSS using DOM, this will decode the entities a long the way:

$dom = new DOMDocument();
$dom->loadXml($rss);
$xpath = new DOMXpath($dom);

Iterate the items:

foreach ($xpath->evaluate('/rss/channel/item') as $rssItem) {

The title of an item is only a text value it can be used directly:

  echo 'Title: ', $xpath->evaluate('string(title)', $rssItem), "\n";

The description in your example contains the html fragment in a text node with escaped entities, I have seen other example with a CDATA. It doesn't really matter for the outer xml document. It is text and if you read is as text the entities will get transformed back into their respective characters.

  $description = $xpath->evaluate('string(description)', $rssItem);

So now $description contains < and > again. It can be loaded into a DOM with loadHtml() or just cleaned up with strip_tags().

  echo 'Description: ', strip_tags($description), "\n\n";

A full example (RSS adapted from Wikipedia):

$rss = <<<'RSS'
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel> 
 <item>
  <title>Example entry</title>
  <description>Here is some &lt;b&gt;text&lt;/b&gt; containing an interesting &lt;i&gt;description&lt;/i&gt; with &lt;span class="important"&gt;html&lt;/span&gt;.</description>
 </item>
</channel>
</rss>
RSS;

$dom = new DOMDocument();
$dom->loadXml($rss);
$xpath = new DOMXpath($dom);

foreach ($xpath->evaluate('/rss/channel/item') as $rssItem) {
  echo 'Title: ', $xpath->evaluate('string(title)', $rssItem), "\n";
  $description = $xpath->evaluate('string(description)', $rssItem);
  echo 'Description: ', strip_tags($description), "\n\n";
}

Output:

Title: Example entry
Description: Here is some text containing an interesting description with html.

+1 This is the only way for profit... Two years since this post, and still so many questions with both xml and regex tags.

Siraj Khan · Accepted Answer · 2013-11-21 11:14:41Z

0

for decoding you can user htmlspecialchars_decode

for more detail please check http://php.net/manual/en/function.htmlspecialchars-decode.php

answered Nov 21, 2013 at 11:14

Siraj Khan

2,35818 silver badges18 bronze badges

Comments

Casimir et Hippolyte · Accepted Answer · 2013-11-21 11:37:24Z

0

To obtain quickly the raw text (without tags) you can do this replacement:

$result = preg_replace('~&lt;.*?&gt;~s', ' ', $source);

answered Nov 21, 2013 at 11:37

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

1 Comment

user2296208 Over a year ago

source and result are outputting exactly the same after trying this.. can u please explain the regex?

Bohemian · Accepted Answer · 2014-03-27 12:07:45Z

0

This gives you all the texts you're seeking as an array:

preg_match_all("/(?<=&gt;)(?!&lt;).*?(?=&lt;)/", $source, $result);

See a live demo of this regex working with your sample input.

edited Mar 27, 2014 at 12:07

answered Nov 22, 2013 at 4:44

Bohemian♦

427k103 gold badges603 silver badges750 bronze badges

Collectives™ on Stack Overflow

php reg ex to find data not in html tags but identify html using < and >

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related