7

I'm writing a script that takes a webpage and detects how many times stuff like a facebook like button is used. Since this would best be done with a DOM, I decided to use PHP's DOMDocument.

The one problem I have come across, though, is for elements like facebook's like button:

<fb:like send="true" width="450" show_faces="true"></fb:like>

Since this element technically has a namespace of "fb", DOMDocument throws a warning saying this namespace prefix is not defined. It then proceeds to strip off the prefix, so when I get to said element, its tag is no longer fb:like, but instead, like.

Is there any way to "pre-register" a namespace? Any suggestions?

6 Answers 6

4

You could use tidy to spruce things up before using an xml parser on it.

$tidy = new tidy();
$config = array(
    'output-xml'   => true, 
    'input-xml'    => true, 
    'add-xml-decl' => true,
);
$tidy->ParseString($htmlSoup, $config);
$tidy->cleanRepair();
echo $tidy;
Sign up to request clarification or add additional context in comments.

Comments

1

Since this was never "solved" I decided to go ahead and implement syndance's solution for anyone else who doesn't like figuring out regular expressions.

// do this before you use loadHTML()    
// store any name spaced elements so we can re-add them later
$postContent = preg_replace('/<(\w+):(\w+)/', '<\1 data-namespace="\2"' , $postContent);

// once you are done using domdocument fix things up
// re-construct any name-spaced tags
$postContent = preg_replace('/<(\w+) data-namespace="(\w+)"/', '<\1:\2 ' , $postContent);

1 Comment

This is a GREAT start, but seems to make tags get cut off after a dash. For example, gcse:searchbox-resultsonly becomes just gcse:searchbox
0

I was having the same issue and I came up with following solutions/workarounds:

There is no clean way to parse HTML with namespaces using DOMDocument without losing the namespaces but there are some workarounds:

  • Use another parser that accepts namespaces in HMTL code. Look here for a nice and detailed list of HTML parsers. This is probably the most efficient way to do it.
  • If you want to stick with DOMDocument you basically have to pre- and postprocess the code.

    • Before you send the code to DOMDocument->loadHTML, use regex, loops or whatever you want to find all namespaced tags and add a custom attribute to the opening tags containing the namespace.

      <fb:like send="true" width="450" show_faces="true"></fb:like>
      

      would then result in

      <fb:like xmlNamespace="fb" send="true" width="450" show_faces="true"></fb:like>
      
    • Now give the edited code to DOMDocument->loadHTML. It will strip out the namespaces but it will keep the attributes resulting in

      <like xmlNamespace="fb" send="true" width="450" show_faces="true"></like>
      
    • Now (again using regex, loops or whatever you want) find all tags with the attribute xmlNamespace and replace the attribute with the actual namespace. Don't forget to also add the namespace to the closing tags!

I don't think OP is still looking for an answer, I'm just posting this for anybody that finds this post in their research.

1 Comment

this sounded like a very straight forward solution so i decided to run with it. Here is the code I ended up with for anyone who hates regex. // store any name spaced elements so we can re-add them later $postContent = preg_replace('/<(\w+):(\w+)/', '<\1 namespace="\2"' , $postContent); // re-construct any name-spaced tags $postContent = preg_replace('/<(\w+) namespace="(\w+)"/', '<\1:\2 ' , $postContent);
0

Is this what you are looking for?

You could try SimpleHTMLDOM. You can then run something like...

$html = new simple_html_dom();
$html->load_file('fileToParse.html');
$count=0;
foreach($html->find('fb:like') as $element){
    $count+=1
}
echo $count;

That should work.

I looked a bit further and found this. I took this from the DOMDocument on PHP.net.

$dom = new DOMDocument;
$dom->loadHTML('fileToParse.html'); // or $dom->loadXML('fileToParse.html'); 
$likes = $dom->getElementsByTagName('fb:like');
$count=0;
foreach ($likes as $like) {
    $count+=1;
}

After this one I am stuck

$file=file_get_contents("other.html");
$search = '/<fb:like[^>]*>/';
$count  = preg_match_all($search , $file, $matches);
echo $count;
//Below is not needed
print_r($matches);

That however is RegEx and is quite slow. I Tried:

$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
$dom->load("other.html");
$xpath = new DOMXPath($dom);
$rootNamespace = $dom->lookupNamespaceUri($dom->namespaceURI); 
$xpath->registerNamespace('fb', $rootNamespace); 
$elementList = $xpath->query('//fb:like'); 

But got the same error as you.

6 Comments

I was using this before, but I wanted to use a native solution for sake of speed. I may have to default back to this though :(
@Obto I use this on my small sites so I have no problems with speed.
I have updated this for another solution that should be quicker.
Sadly that doesn't work. the fb namespace prefix is stripped out while parsing the html. So when searching this will find nothing, you'd have to search for "like" instead.
Thought of doing that, but the pages don't parse at all heh. DOMDocument's loadHTML apparently has a LOT of html info built in.
|
0

Haven't been able to find a way to do it with DOM. I'm surprised the regex is slower than DOMDocument as that's usually not the case for me. strpos should be the fastest, though:

strpos($dom, '<fb:like');

This only finds the first occurance, but you can write a simple recursive function that changes the offset appropriately.

Comments

-1

tried the regEx-solution... there's a problem with the closing tags, as they do not accept attributes!

<ns namespace="node">text</ns>

(above all, the regEx didn't look for closing tags...) so finally i did some UGLY stuff like

$output = preg_replace('/<(\/?)(\w+):(\w+)/', '<\1\2thistaghasanamespace\3' , $output);

and

$output = preg_replace('/<(\/?)(\w+)thistaghasanamespace(\w+)/', '<\1\2:\3' , $output);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.