XML Regex Extraction

Question

I have an XML file and I need to extract data out of it.This task would be trivial if I only could use Xdocument, but the whole point of exercise is to create own parser using regex. The XML looks similar to below:

<A>
    <B>
        <C>ASD</C>
    </B>
    <B>
        <C>ZXC</C>
    </B>
</A>

I Came up with an idea that I can divide input to both closing and opening tag and their content.

        string acquiredFile = myStringBuilder.ToString();
        string regexPattern = "(?<open><[A-z0-9]{1,}>)(?<content>.*)(?<close></[A-z0-9]{1,}>)";
        Regex rx = new Regex(regexPattern, RegexOptions.Singleline);


        foreach (Match match in Regex.Matches(acquiredFile, regexPattern, RegexOptions.Singleline))
        {
            Console.WriteLine(match.Groups["open"].Value);
            Console.WriteLine(match.Groups["content"].Value);
            Console.WriteLine(match.Groups["close"].Value);
        }

I need to wrap it up in loop. Above extraction solution works only for single nested element in XML document such as:

<A>
    <B>
        <C>ASD</C>
    </B>
</A>

Could you please help me how to expand this code to get it to work with multiple nested elements.

Your code should work just fine with more than one nested element. (ideone.com/8iKn5i) — l'L'l
– l'L'l, Commented Jul 5, 2014 at 7:59
Unfortunately it does not, I get the as opening tag, as closing tag and <C>ASD</C><C>ZXC</C> as content — user2847238
– user2847238, Commented Jul 5, 2014 at 8:04
Did you observe the example I linked? Your input might differ perhaps, which could throw it off. — l'L'l
– l'L'l, Commented Jul 5, 2014 at 8:05
The .NET framework has an extensive supply of means to deal with XML the proper way. You are not to use regular expressions on XML. There is no excuse for trying. Please use an API. — Tomalak
– Tomalak, Commented Jul 5, 2014 at 8:12

Community · Accepted Answer · 2017-05-23 12:11:00Z

You can deal with nested elements by recursion:

Wrap the code you use into a function

Parse(string html)
{
    var matches = Regex.Matches(html, yourRegexp, RegexOptions.Singleline);
    if (!matches.Any())
    {
       Console.WriteLine("CONTENT:"+html);
    }
    foreach (Match match in matches)
    {
       Console.WriteLine("OPEN:"+match.Groups["open"].Value);
       parse(match.Groups["content"].Value);
       Console.WriteLine("CLOSE:"+match.Groups["close"].Value);
    }
}

However, let me discourage you a bit first:

The above approach will not work with your regex (?<open><[A-z0-9]{1,}>)(?<content>.*)(?<close></[A-z0-9]{1,}>).
The first problem, as you mentioned, are the multiple consecutive ...... tags. Your regexp will capture everything from the first  to the last  into one group.

Now, a simple bugfix for this problem would be this regex <(?<open>[A-z0-9]{1,})>(?<content>.*?)<\1>, which will non-greedily match anything between the first <TAGNAME> and the next </TAGNAME2>, where TAGNAME and TAGNAME2 are the same string.

Looks good? Well, it is not, because this regexp will fail for nested elements with the same name, like <C></C>.

You will continue to run into these problems. As you come up with more and more complicated regex there will always be some sort of counterexample that causes them to break.

This is because regex are the wrong tools for this sort of task. You are trying to capture a Chomsky type 3 grammar with a Chomsky type 2 grammar. (Also see this humorous take on the subject).

In the end writing a proper parser for xml is far from a simple task, that is why the usual recommendation is to always go with one of the standard ones.

Collectives™ on Stack Overflow

XML Regex Extraction

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related