Split String by XML elements

Question

I have the following input string which is from a 10MB text file

string data = "0x52341\n0x52341<?xml version=\"1.0\" encoding=\"UTF-8\"?><element1 value=\"3\">1</element1>0x52341\n0x52341 <element1><element>2</element></element1>0x52341<element2>3</sub‌‌></element2> <element2>4</element2>0x4312";

now I want this string by element1 and element2 XML nodes

the result in this case should be

output[0] = "<element1 value="3"><sub>1</sub></element1>";
output[1] = "<element1><sub><element>2</element></sub></element1>";
output[2] = "<element2><sub>3</sub></element2>";
output[3] = "<element2><sub>4</sub></element2>";

my efford:

i have tried Regular Expression but that's very slow in case of that big file and i have also tried

string[] output= input.Split(new string[] { "<element1>", "<element2>" }, StringSplitOptions.None);

string.Split() is circuitous because it throws outofmemory exceptions and the delemiter is being removed at splitting.

question: is there a easy way to parse those xml elements out of my text file?

update: I simplified my file because i couldn't post the whole 10MB file in SO - sometimes there are 0x1234 values between the xml elements sometimes not

i understand that you using c#, you have lot of stuff to deal with html parsing: selenium, .Net htmlagilitypack, mshtml why didn't use them for that purpose ? — Leon Barkan
– Leon Barkan, Commented Oct 16, 2015 at 12:03
If you are dealing with XML use an XML parser like Linq-to-XML. — juharr
– juharr, Commented Oct 16, 2015 at 12:08
@juharr - that doesn't work, already tried that appraoch. i don't have one root, i have many roots (element1 and element2) — Byyo
– Byyo, Commented Oct 16, 2015 at 12:11
@kenny - no the whole file not but the element1 and element2 are valid xml — Byyo
– Byyo, Commented Oct 16, 2015 at 13:00

Anton Gogolev · Accepted Answer · 2015-10-16 12:25:34Z

3

If you can guarantee that each <elementX></elementX> fragment is a well-formed XML node (so to speak), wrap the entire string in <elements> ... </elements> and deal with it using standard .NET approaches, be it XmlDocument, Linq to XML or whatever else fits you.

answered Oct 16, 2015 at 12:25

Anton Gogolev

116k39 gold badges204 silver badges293 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Anton Gogolev Over a year ago

@Byyo Where and what exactly throws? Having text inbetween tags does not make XML ill-formed.

Rom Eh Over a year ago

@Byyo you can parse the text block by block (4096 characters), and delete these characters.

Byyo Over a year ago

XDocument xDoc = XDocument.Parse() throws unexpected xml declaration. the xml declaration must be the first node in the document

Byyo Over a year ago

XmlDocument x = new XmlDocument(); x.LoadXml(file); throws exactly the same error - just tested

Anton Gogolev Over a year ago

@Byyo Paste here whatever there is in file, verbatim.

|

fixagon · Accepted Answer · 2015-10-16 13:29:41Z

1

EDIT: A faster alternative (as its not using Regex) which is not replacing 0x... fragments within the content of the elements would be the following one:

string data = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>0x52341<element1 value=\"3\"><sub>1</sub></element1>0x234512 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub></element2> <element2><sub>4</sub></element2>0x4312";

XmlReaderSettings xrs = new XmlReaderSettings();
xrs.ConformanceLevel = ConformanceLevel.Fragment;
XDocument doc = new XDocument(new XElement("root"));
XElement root = doc.Descendants().First();

using(var ms = new StreamWriter(new MemoryStream()))
{
    ms.Write(data);
    ms.Flush();
    ms.BaseStream.Position = 0;
    using (StreamReader fs = new StreamReader(ms.BaseStream))
    //using (StreamReader fs = new StreamReader("file.xml"))
    {
        using (XmlReader rdr = XmlReader.Create(fs, xrs))
        {
            while (rdr.Read())
            {
                if (rdr.NodeType == XmlNodeType.Element)
                {
                    root.Add(XElement.Load(rdr.ReadSubtree()));
                }
            }
        }
    }
}

you could also read directly from the file with another StreamReader constructor (remove the StreamWriter part)

edited Oct 16, 2015 at 13:29

answered Oct 16, 2015 at 12:44

fixagon

5,56624 silver badges26 bronze badges

7 Comments

fixagon Over a year ago

check my edit --> using no regex should be faster and wont replace the content within the elementx's

fixagon Over a year ago

following exception: Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.

fixagon Over a year ago

you have to remove the white space before the "<?xml" otherwise its not valid xml. if you have it anyway in the file you have to remove it first...

Byyo Over a year ago

i think the error is because it's <root><?xml version=\"1.0\" encoding=\"UTF-8\"?> instead of <?xml version=\"1.0\" encoding=\"UTF-8\"?><root> - removing the first space doesn't help

fixagon Over a year ago

use the second example individually --> dont use the code from before the edit. (only the first line for example data creation) --> i edited the code, now it should work

|

Lorek · Accepted Answer · 2015-10-16 12:33:34Z

0

Here is a console app that will do it:

class Program
{
    static void Main(string[] args)
    {
        string source = "0x52341<element1 value=\"3\"><sub>1</sub></element1>0x234512 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub></element2> <element2><sub>4</sub></element2>0x4312";
        List<string> components = new List<string>();
        while (source.Length > 0)
        {
            int start = source.IndexOf('<');
            if (-1 == start)
                break;
            int next = source.IndexOf("0x", start, StringComparison.OrdinalIgnoreCase);
            if (-1 == next)
                break;
            components.Add(source.Substring(start, next - start));
            source = source.Substring(next);
        }
        foreach (string s in components)
            Console.WriteLine(s);
        Console.ReadLine();
    }
}

Try that out.

answered Oct 16, 2015 at 12:33

Lorek

8555 silver badges11 bronze badges

3 Comments

Byyo Over a year ago

that doesn't work because i also have 0x values within my <elementx>

Lorek Over a year ago

Ah, okay. I didn't see that case in the example data. I just realized I also do not handle the case where you are leading one of the components with a space instead of a hex value. I'll give it some more thought.

Byyo Over a year ago

i simplified my file because i couldn't post the whole 10MB file in SO - sometimes there are 0x1234 values sometimes not, so i can't rely on them

Ondrej Svejdar · Accepted Answer · 2015-10-16 14:20:00Z

This processes the file as stream - looks for opening and closing element, parsing only those elements in process:

  using (var stream = File.OpenRead("..."))
  {
    StringBuilder builder = null;
    StringBuilder xml = null;
    using (var reader = new StreamReader(stream, Encoding.UTF8))
    {
      while (!reader.EndOfStream)
      {
        char c = (char)reader.Read();
        if (c == '<' && builder == null)
        {
          builder = new StringBuilder();
        }
        if (builder != null)
        {
          builder.Append(c);
        }
        if (xml != null)
        {
          xml.Append(c);
        }

        if (c == '>')
        {
          var token = builder.ToString();
          if (xml == null)
          {
            if (token.StartsWith("<element1", StringComparison.Ordinal) || token.StartsWith("<element2", StringComparison.Ordinal))
            {
              xml = new StringBuilder("<?xml version='1.0' encoding='utf-8' ?>");
              xml.Append(token);
            }
          }
          else
          {
            if (token.StartsWith("</element1", StringComparison.Ordinal) || token.StartsWith("</element2", StringComparison.Ordinal))
            {
              XElement element = XElement.Parse(xml.ToString());
              // do something with the element
              xml = null;
            }
          }
          builder = null;
        }
      }
    }
  }

Collectives™ on Stack Overflow

Split String by XML elements

4 Answers 4

6 Comments

7 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

7 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related