0

I am new to regex. I want to only capture the text portion from <firstpar> or to remove all <asmbly> with all its children nodes and values. Can anyone show me how to do that. The following is the snap shot of the xml fiel. thanks.

<?xml version="1.0" encoding="UTF-8"?>
<firstpar>
    <thumbcred>Sample 1 thumbcred</thumbcred>
    <asmbly>
       <caption>
           <p><work ty="drawing">Two Fabulous Animals</work>Sample 1 <e> sample 1caption </e></p>
        </caption>
        <credit>Paul Miller/AP</credit>
        <asset id="126099" hgt="450" wdth="289" tmstp="24-OCT-08"
            bintype="2" filename="images/sample126099.jpg" source="eb" bighgt="1600"
            bigwdth="1029" bigfilename="botany003.jpg"
            bigdeployfullfilename="/eb-media/99/126099-050-CAD1EF0A.jpg"
        />

        <copyright>Copyright © 1994-2013 Encyclopædia Britannica,  Inc.</copyright>
    </asmbly>

Sample firstpar text <e>Sample e</e> just some
text <sub>sample sub </sub><e>sample e text again</e> more text with sup sub e. 

    </firstpar>
5
  • 7
    I'm no expert on the matter, but I think you may want an xml parser, not regex Commented Aug 12, 2013 at 19:19
  • 1
    Use an XML parsing library, NOT regex. XML is a context-free language, not a regular language. Commented Aug 12, 2013 at 19:19
  • There are many good (and free) XML parsers available. What language are you using so we can point you towards the right tool and how to use it? Commented Aug 12, 2013 at 19:22
  • I am trying to get the text portion of <firstpar> in c#. Is there a good xml parsers that you can suggest? Thanks. Commented Aug 12, 2013 at 19:25
  • Why can't you use LINQ to XML? Commented Aug 12, 2013 at 19:42

1 Answer 1

2

Unfortunately, one of the known limitations of regex is that it does not handle nesting

You can and should use whatever XML parser is available in whatever language you're using.


If you have a very specifically formed piece of XML, and a very specific goal, than it is possible to use regex to perform some operations on it, but once you try to apply your regex to a non-specific piece of xml, it will be unable to handle it.

Sign up to request clarification or add additional context in comments.

1 Comment

That first statement is a bit of a generalisation. Both PCRE's and .NET's regex flavor can very well handle nesting (and the OP happens to be using C#), and for some simple cases quite elegantly. It's more that XML is ridiculously complex, because of attribute values, XML comments, CDATA and whatnot which makes it near impossible to write a robust regex on XML.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.