1

I'm not very good in regex... so if somebody could help me with this one (maybe trivial)

[update] First i'm not looking for the best way of manipulating xml (SimpleXMLElement,DOM etc... is fine). I'm just looking for this regex out of the context off XML.

i have xml like that

<myxml>
<node>21</node> som text with <entite>some</entite> other <b>nodes</b>
<node>22</node> some text
</myxml>

I would like to extract <node> with all other entite and text block until next <node> result could be like :

Array {
 [0] = "<node>21</node> som text with <entite>some</entite> other <b>nodes</b>",
 [1] = "<node>22</node> some text"
}

I don't want to use DOMElement for parsing the XML, so i realy looking for regex.

thanks if you have an idea.

2
  • 2
    This would be much better if the text you wanted to parse was also xml Commented Jul 26, 2010 at 23:41
  • i know but unfortunaly the xml source is like that and i could not change it. Commented Jul 27, 2010 at 0:06

2 Answers 2

6

Please don't use regexes to parse XML. That's what XML parsers are for.

PHP has many built right in. Try the DOM or SimpleXML on for size. Given your requirement of picking up text nodes between two sibling tags, you might also consider working with XMLReader, it may well be easier for you to work with for this specific task.

Sign up to request clarification or add additional context in comments.

7 Comments

Parsing HTML, XHTML, or XML with regular expressions is just asking for trouble. In fact, there's a (somewhat) famous post here on Stack Overflow on this subject, but I don't know where it is right now.
With SimpleXML you can't manipulate text block... the DOM is fine, i use it for other part of my app but i need this regex for some specifics reasons. it's a regex question not about xml treatment and parsing :) i just want to have the expression.
Note that the user explicitly says that (s)he doesn't want to use the DOM, perhaps it failed, or the XML is not real XML, or (s)he wants to experiment with regexes?
@Abel, that phrase was added by the asker after I created my answer... and doesn't change the answer at all. Parsing XML with regular expressions may invoke the wrath of Zalgo himself.
I didn't mean it changes your answer and I wasn't aware he added it later. Yes, it's evil, I know. It reminds me somehow on earlier famous XML frameworks that were solely based on regex-parsing (one such in Perl iirc). See also his own remark under your answer: there's apparently a pressing reason to risk Zalgo's wrath :)
|
1

Use splitting to chunk this down:

<?php

$str = <<<EOT
<myxml>
<node>21</node> som text with <entite>some</entite> other <b>nodes</b>
<node>22</node> some text
</myxml>
EOT;

$res = array_slice( preg_split( "~(?=<node(?:[^>]|\".*?\"|'.*?')*>|</myxml>)~", $str ), 1, -1 );
print_r( $res );

Breakdown of the expression:

(?=           # match before
  <node       # "<node"
  (?:         # match and don't capture this group
    [^>]        # match non ">"
    |           # OR
    \".*?\"     # match '"' and anything (don't be greedy) until the next '"'
    |           # OR
    '.*?'       # match "'" and anything (don't be greedy) until the next "'"
  )*          # ... as often as you like
  >           # ">"
  |           # OR
  </myxml>    # "</myxml>"
)             # 

You can throw out the ([^>]|\".*?\"|'.*?')* part if you are sure that <node> never has any attributes.

Mandatory disclaimer: Please don't do this. Parsing XML with regexp is a really bad idea!

1 Comment

thanks a lot for the answer and the explanation of the regex !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.