PHP DOMDocument Namespaces

Question

I'm writing a script that takes a webpage and detects how many times stuff like a facebook like button is used. Since this would best be done with a DOM, I decided to use PHP's DOMDocument.

The one problem I have come across, though, is for elements like facebook's like button:

<fb:like send="true" width="450" show_faces="true"></fb:like>

Since this element technically has a namespace of "fb", DOMDocument throws a warning saying this namespace prefix is not defined. It then proceeds to strip off the prefix, so when I get to said element, its tag is no longer fb:like, but instead, like.

Is there any way to "pre-register" a namespace? Any suggestions?

hakre · Accepted Answer · 2013-04-09 07:41:32Z

4

You could use tidy to spruce things up before using an xml parser on it.

$tidy = new tidy();
$config = array(
    'output-xml'   => true, 
    'input-xml'    => true, 
    'add-xml-decl' => true,
);
$tidy->ParseString($htmlSoup, $config);
$tidy->cleanRepair();
echo $tidy;

edited Apr 9, 2013 at 7:41

hakre

200k55 gold badges454 silver badges866 bronze badges

answered Jun 11, 2012 at 19:27

goat

31.9k7 gold badges76 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:18:10Z

1

Since this was never "solved" I decided to go ahead and implement syndance's solution for anyone else who doesn't like figuring out regular expressions.

// do this before you use loadHTML()    
// store any name spaced elements so we can re-add them later
$postContent = preg_replace('/<(\w+):(\w+)/', '<\1 data-namespace="\2"' , $postContent);

// once you are done using domdocument fix things up
// re-construct any name-spaced tags
$postContent = preg_replace('/<(\w+) data-namespace="(\w+)"/', '<\1:\2 ' , $postContent);

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Aug 12, 2015 at 19:02

lupos

3741 silver badge9 bronze badges

1 Comment

MadtownLems Over a year ago

This is a GREAT start, but seems to make tags get cut off after a dash. For example, gcse:searchbox-resultsonly becomes just gcse:searchbox

Syndace · Accepted Answer · 2015-05-10 07:54:24Z

0

I was having the same issue and I came up with following solutions/workarounds:

There is no clean way to parse HTML with namespaces using DOMDocument without losing the namespaces but there are some workarounds:

Use another parser that accepts namespaces in HMTL code. Look here for a nice and detailed list of HTML parsers. This is probably the most efficient way to do it.
If you want to stick with DOMDocument you basically have to pre- and postprocess the code.
- Before you send the code to DOMDocument->loadHTML, use regex, loops or whatever you want to find all namespaced tags and add a custom attribute to the opening tags containing the namespace.
```
<fb:like send="true" width="450" show_faces="true"></fb:like>
```
  would then result in
```
<fb:like xmlNamespace="fb" send="true" width="450" show_faces="true"></fb:like>
```
- Now give the edited code to DOMDocument->loadHTML. It will strip out the namespaces but it will keep the attributes resulting in
```
<like xmlNamespace="fb" send="true" width="450" show_faces="true"></like>
```
- Now (again using regex, loops or whatever you want) find all tags with the attribute xmlNamespace and replace the attribute with the actual namespace. Don't forget to also add the namespace to the closing tags!

I don't think OP is still looking for an answer, I'm just posting this for anybody that finds this post in their research.

answered May 10, 2015 at 7:54

Syndace

966 bronze badges

1 Comment

lupos Over a year ago

this sounded like a very straight forward solution so i decided to run with it. Here is the code I ended up with for anyone who hates regex. // store any name spaced elements so we can re-add them later $postContent = preg_replace('/<(\w+):(\w+)/', '<\1 namespace="\2"' , $postContent); // re-construct any name-spaced tags $postContent = preg_replace('/<(\w+) namespace="(\w+)"/', '<\1:\2 ' , $postContent);

Jonathan · Accepted Answer · 2012-06-12 16:22:39Z

0

Is this what you are looking for?

You could try SimpleHTMLDOM. You can then run something like...

$html = new simple_html_dom();
$html->load_file('fileToParse.html');
$count=0;
foreach($html->find('fb:like') as $element){
    $count+=1
}
echo $count;

That should work.

I looked a bit further and found this. I took this from the DOMDocument on PHP.net.

$dom = new DOMDocument;
$dom->loadHTML('fileToParse.html'); // or $dom->loadXML('fileToParse.html'); 
$likes = $dom->getElementsByTagName('fb:like');
$count=0;
foreach ($likes as $like) {
    $count+=1;
}

After this one I am stuck

$file=file_get_contents("other.html");
$search = '/<fb:like[^>]*>/';
$count  = preg_match_all($search , $file, $matches);
echo $count;
//Below is not needed
print_r($matches);

That however is RegEx and is quite slow. I Tried:

$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
$dom->load("other.html");
$xpath = new DOMXPath($dom);
$rootNamespace = $dom->lookupNamespaceUri($dom->namespaceURI); 
$xpath->registerNamespace('fb', $rootNamespace); 
$elementList = $xpath->query('//fb:like');

But got the same error as you.

edited Jun 12, 2012 at 16:22

answered Jun 11, 2012 at 18:25

Jonathan

5857 silver badges27 bronze badges

6 Comments

Obto Over a year ago

I was using this before, but I wanted to use a native solution for sake of speed. I may have to default back to this though :(

Jonathan Over a year ago

@Obto I use this on my small sites so I have no problems with speed.

Jonathan Over a year ago

I have updated this for another solution that should be quicker.

Obto Over a year ago

Sadly that doesn't work. the fb namespace prefix is stripped out while parsing the html. So when searching this will find nothing, you'd have to search for "like" instead.

Obto Over a year ago

Thought of doing that, but the pages don't parse at all heh. DOMDocument's loadHTML apparently has a LOT of html info built in.

|

Explosion Pills · Accepted Answer · 2012-06-12 16:35:32Z

0

Haven't been able to find a way to do it with DOM. I'm surprised the regex is slower than DOMDocument as that's usually not the case for me. strpos should be the fastest, though:

strpos($dom, '<fb:like');

This only finds the first occurance, but you can write a simple recursive function that changes the offset appropriately.

answered Jun 12, 2012 at 16:35

Explosion Pills

192k56 gold badges341 silver badges417 bronze badges

Comments

BernieMaier · Accepted Answer · 2016-02-16 16:14:55Z

-1

tried the regEx-solution... there's a problem with the closing tags, as they do not accept attributes!

<ns namespace="node">text</ns>

(above all, the regEx didn't look for closing tags...) so finally i did some UGLY stuff like

$output = preg_replace('/<(\/?)(\w+):(\w+)/', '<\1\2thistaghasanamespace\3' , $output);

and

$output = preg_replace('/<(\/?)(\w+)thistaghasanamespace(\w+)/', '<\1\2:\3' , $output);

answered Feb 16, 2016 at 16:14

BernieMaier

392 bronze badges

Collectives™ on Stack Overflow

PHP DOMDocument Namespaces

6 Answers 6

Comments

1 Comment

1 Comment

6 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

1 Comment

1 Comment

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related