0

I am parsing a rss feed to json using php.

using below code

my json output contains data out of description from item element but title and link data not extracting

  • problem is some where with incorrent CDATA or my code is not parsing it correctly.

xml is here

$blog_url = 'http://www.blogdogarotinho.com/rssfeedgenerator.ashx';

$rawFeed = file_get_contents($blog_url);
$xml=simplexml_load_string($rawFeed,'SimpleXMLElement', LIBXML_NOCDATA);

// step 2: extract the channel metadata
$articles = array();    

// step 3: extract the articles

foreach ($xml->channel->item as $item) {
    $article = array();

    $article['title'] = (string)trim($item->title);
    $article['link'] = $item->link;      
    $article['pubDate'] = $item->pubDate;
    $article['timestamp'] = strtotime($item->pubDate);
    $article['description'] = (string)trim($item->description);
    $article['isPermaLink'] = $item->guid['isPermaLink'];        

    $articles[$article['timestamp']] = $article;
}

echo json_encode($articles);
5
  • If i run your example my output contains a bunch of <![CDATA[ tags. However I'm not sure if you are seeing the same thing? Do you want them removed? Or you are not seeing their content at all? Commented Jun 1, 2014 at 17:33
  • I am not getting any thing for title and link. it give me nothing Commented Jun 1, 2014 at 17:43
  • I think this could be because of different php/libxml versions (I'm running 5.5.12 here), tried it on php 5.4.29 and 5.3.23 too but got the same result. What PHP version are you on? Commented Jun 1, 2014 at 17:54
  • @my localhost I am using 5.5.6 even on server. After parsing the xml to json I am getting a blank value for link and title both on localhost and server ... however I tried downloading the xml to a file and parsing that gives same result ..... One thing I tried is putting a <br> tag after <![CDATA[ in xml for title I was able to success fully parse.... but still that dows not solves the issue Commented Jun 1, 2014 at 18:06
  • Note that trim() always returns a string, so the (string) in (string)trim($item->title) is doing nothing; if anything, you would need to cast its input, which would be trim((string)$item->title), although it will probably do that implicitly anyway. You should however cast your other values, e.g. $article['link'] = (string)$item->link; before passing them off to other functions. Commented Jun 1, 2014 at 18:37

1 Answer 1

2

I think you are just the victim of the browser hiding the tags. Let me explain: Your input feed doesn't really has <![CDATA[ ]]> tags in them, the < and >s are actually entity encoded in the raw source of the rss stream, hit ctrl+u on the rss link in your browser and you will see:

<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="2.0">
  <channel>
    <description>Blog do Garotinho</description>
    <item>
      <description>&lt;![CDATA[&lt;br&gt;
          Fico impressionado com a hipocrisia e a falsidade de certos políticos....]]&gt;
      </description>
      <link>&lt;![CDATA[http://www.blogdogarotinho.com.br/lartigo.aspx?id=16796]]&gt;</link>
...
      <title>&lt;![CDATA[A bancada dos caras de pau]]&gt;</title>
    </item>

As you can see the <title> for example starts with a &lt; which when will turn to a < when simplexml returns it for your json data. Now if you are looking the printed json data in a browser your browser will see the following:

"title":"<![CDATA[A bancada dos caras de pau]]>"

Which will will not be rendered because it's inside a tag. The description seem to show up because it has a <br> tag in it at some point which ends the first "tag" and thus you can see the rest of the output.

If you hit ctrl+u you should see the output printed as expected (i myself used a command line php file and did not notice this first).

Try this demo:

You could try to get rid of these by simply replacing them out after the parse with a simple preg_replace():

function clean_cdata($str) {
    return preg_replace('#(^\s*<!\[CDATA\[|\]\]>\s*$)#sim', '', (string)$str);
}

This should take care of the CDATA blocks if they are at the start or the end of the individual tags. You can throw call this inside the foreach() loop like this:

// ....
$article['title'] = clean_cdata($item->title);
// ....
Sign up to request clarification or add additional context in comments.

10 Comments

Yeah, whatever is generating that XML is definitely doing it wrong. Ampersand-encoding (&gt; etc) and CDATA are alternative escape mechanisms, but it's somehow using both at once.
Yup, I'm not sure either what was the idea behind including the <![CDATA[ tags and then doing the entity encoding on the content with that included. However the xml file in itself is valid just have these pointless tags
So how can I get clean json out of it do I need some string replace .. basically I am a java developer and this php thing it getting me out of mind.. I have seen your link but on the client end I should send clean json .. I mean without CDATA part.
Well, I'm afraid yes, I would probably try that too. You could try to contact the source of your rss feed and ask for explanation, maybe we are missing something.
I would like to mark your answer but I am still confused sorry for that but it still deserves upvotes Thanks ... if you could suggest me some thing on thing I have spent my whole day on this CDATA issue..thanks
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.