2

I'm trying to format a XML document, so I pass a string into a method, such as:

"<foo><subfoo><subsubfoo>content</subsubfoo></subfoo><subfoo/></foo>"

And I'm trying to split it based on finding the tags. I want to split each element (a tag, or content) into a unique string, such as:

"<foo>", "<subfoo>", "<subsubfoo>", "content", "</subsubfoo>", "</subfoo>", "<subfoo/>", "</foo>"

And to this end I use the code:

string findTagString = "(?<=<.*?>)";
Regex findTag = new Regex(findTagString);
List<string> textList = findTag.Split(text).ToList();

The above code works fine, except it doesn't split "content" into its own string, instead:

"<foo>", "<subfoo>", "<subsubfoo>", "content</subsubfoo>", "</subfoo>", "<subfoo/>", "</foo>"

Is there a way to rewrite the Regex to acomplish this, the splitting of non-matches into their own string?

Or, rephrased: Is it possible to split a string before AND after a Regex match?

2
  • WHY do you want to do this? What is the end goal? There are probably more efficient ways to do this. Commented Jul 10, 2012 at 18:50
  • I'm just trying to create a group containing each tag or element so I can format them and place them into a FlowDocument to load into a RichTextBox (WPF). This is just how I'm aiming to break it into parts so I can examine, format, and insert the pieces. Commented Jul 10, 2012 at 19:01

4 Answers 4

4

use this regex (<.*?>)|(.+?(?=<|$)) and cast matches to List<string>

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, that does the trick. Is there any way to remove the empty strings/not pick them up in the first place besides iterating through the list and removing empty ones?
you can replace empty tags recursive, or use this regex (?<=>)([^<>]+?)(?=<) for get value from tags
2

Since by ignoring html specification, <> has no significance.

It can simply be done via split with this (?<=>)|(?=<).

This yields

<foo>
<subfoo>
<subsubfoo>
content
</subsubfoo>
</subfoo>
<subfoo/>
</foo>

Comments

1

XML is not a Regular Language (can be proven with the Pumping Lemma), therefore XML cannot be parsed with Regular Expressions.

I suggest you find a good XML library and use it.

1 Comment

I'm really just trying to do very basic formatting for the user so it can catch if they don't include a closing tag, or leave an attribute open. A very basic version of NotePad++'s XML view if you will. Thus I don't care what the tag says, just that there is a tag. So the fact that the language isn't finite, and thus isn't Regular, isn't of real concern for my application. Otherwise you would be right. Thanks for your help, SchighSchagh.
1

you can do this via regex or xpath, depending on the complexity of the xml.

if you want to use regular expressions, you'd probably want to do something like this:

public static string xml = "<foo><subfoo><subsubfoo>content</subsubfoo></subfoo><subfoo/></foo>";
public static Regex re = new Regex(@"\<([A-Za-z0-9]*)\b[^>]*\>(.*?)\</\1\>");

static string GetContentViaRegex()
{
    string content = xml;
    while (re.IsMatch(content))
    {
        Match match = re.Match(content);
        if (!match.Success)
            break;

        content = match.Groups[2].Value;
    }
    return content;
}

the regex basically searches for matched opening/ending tags (you don't want to match something like <foo>stuff here, possibly including more tags</bar>), and you keep drilling into the matching tags until you find the innermost content. this regex assumes there are no attributes on any of the tags.

if you wanted to do this via xpath, you could do something like this:

static string GetContentViaXPath()
{
    var nav = new XPathDocument(new StringReader(xml)).CreateNavigator();
    return nav.SelectSingleNode("//text()").Value;
}

which basically grabs the first text node it hits in the document. (you'd want to add error checking unless you're sure the input will always be valid)

1 Comment

Nice regex for getting the whole xml element with subtree. very useful when you are working with xml fragments that are not well formed where XmlDocument, XmlReader will throw exceptions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.