0

I have the following input string which is from a 10MB text file

string data = "0x52341\n0x52341<?xml version=\"1.0\" encoding=\"UTF-8\"?><element1 value=\"3\"><sub>1</sub></element1>0x52341\n0x52341 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub‌​‌​></element2> <element2><sub>4</sub></element2>0x4312";

now I want this string by element1 and element2 XML nodes

the result in this case should be

output[0] = "<element1 value="3"><sub>1</sub></element1>";
output[1] = "<element1><sub><element>2</element></sub></element1>";
output[2] = "<element2><sub>3</sub></element2>";
output[3] = "<element2><sub>4</sub></element2>";

my efford:

i have tried Regular Expression but that's very slow in case of that big file and i have also tried

string[] output= input.Split(new string[] { "<element1>", "<element2>" }, StringSplitOptions.None);

string.Split() is circuitous because it throws outofmemory exceptions and the delemiter is being removed at splitting.

question: is there a easy way to parse those xml elements out of my text file?

update: I simplified my file because i couldn't post the whole 10MB file in SO - sometimes there are 0x1234 values between the xml elements sometimes not

4
  • i understand that you using c#, you have lot of stuff to deal with html parsing: selenium, .Net htmlagilitypack, mshtml why didn't use them for that purpose ? Commented Oct 16, 2015 at 12:03
  • 1
    If you are dealing with XML use an XML parser like Linq-to-XML. Commented Oct 16, 2015 at 12:08
  • @juharr - that doesn't work, already tried that appraoch. i don't have one root, i have many roots (element1 and element2) Commented Oct 16, 2015 at 12:11
  • @kenny - no the whole file not but the element1 and element2 are valid xml Commented Oct 16, 2015 at 13:00

4 Answers 4

3

If you can guarantee that each <elementX></elementX> fragment is a well-formed XML node (so to speak), wrap the entire string in <elements> ... </elements> and deal with it using standard .NET approaches, be it XmlDocument, Linq to XML or whatever else fits you.

Sign up to request clarification or add additional context in comments.

6 Comments

@Byyo Where and what exactly throws? Having text inbetween tags does not make XML ill-formed.
@Byyo you can parse the text block by block (4096 characters), and delete these characters.
XDocument xDoc = XDocument.Parse() throws unexpected xml declaration. the xml declaration must be the first node in the document
XmlDocument x = new XmlDocument(); x.LoadXml(file); throws exactly the same error - just tested
@Byyo Paste here whatever there is in file, verbatim.
|
1

EDIT: A faster alternative (as its not using Regex) which is not replacing 0x... fragments within the content of the elements would be the following one:

string data = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>0x52341<element1 value=\"3\"><sub>1</sub></element1>0x234512 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub></element2> <element2><sub>4</sub></element2>0x4312";

XmlReaderSettings xrs = new XmlReaderSettings();
xrs.ConformanceLevel = ConformanceLevel.Fragment;
XDocument doc = new XDocument(new XElement("root"));
XElement root = doc.Descendants().First();

using(var ms = new StreamWriter(new MemoryStream()))
{
    ms.Write(data);
    ms.Flush();
    ms.BaseStream.Position = 0;
    using (StreamReader fs = new StreamReader(ms.BaseStream))
    //using (StreamReader fs = new StreamReader("file.xml"))
    {
        using (XmlReader rdr = XmlReader.Create(fs, xrs))
        {
            while (rdr.Read())
            {
                if (rdr.NodeType == XmlNodeType.Element)
                {
                    root.Add(XElement.Load(rdr.ReadSubtree()));
                }
            }
        }
    }
}

you could also read directly from the file with another StreamReader constructor (remove the StreamWriter part)

7 Comments

check my edit --> using no regex should be faster and wont replace the content within the elementx's
following exception: Unexpected XML declaration. The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.
you have to remove the white space before the "<?xml" otherwise its not valid xml. if you have it anyway in the file you have to remove it first...
i think the error is because it's <root><?xml version=\"1.0\" encoding=\"UTF-8\"?> instead of <?xml version=\"1.0\" encoding=\"UTF-8\"?><root> - removing the first space doesn't help
use the second example individually --> dont use the code from before the edit. (only the first line for example data creation) --> i edited the code, now it should work
|
0

Here is a console app that will do it:

class Program
{
    static void Main(string[] args)
    {
        string source = "0x52341<element1 value=\"3\"><sub>1</sub></element1>0x234512 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub></element2> <element2><sub>4</sub></element2>0x4312";
        List<string> components = new List<string>();
        while (source.Length > 0)
        {
            int start = source.IndexOf('<');
            if (-1 == start)
                break;
            int next = source.IndexOf("0x", start, StringComparison.OrdinalIgnoreCase);
            if (-1 == next)
                break;
            components.Add(source.Substring(start, next - start));
            source = source.Substring(next);
        }
        foreach (string s in components)
            Console.WriteLine(s);
        Console.ReadLine();
    }
}

Try that out.

3 Comments

that doesn't work because i also have 0x values within my <elementx>
Ah, okay. I didn't see that case in the example data. I just realized I also do not handle the case where you are leading one of the components with a space instead of a hex value. I'll give it some more thought.
i simplified my file because i couldn't post the whole 10MB file in SO - sometimes there are 0x1234 values sometimes not, so i can't rely on them
0

This processes the file as stream - looks for opening and closing element, parsing only those elements in process:

  using (var stream = File.OpenRead("..."))
  {
    StringBuilder builder = null;
    StringBuilder xml = null;
    using (var reader = new StreamReader(stream, Encoding.UTF8))
    {
      while (!reader.EndOfStream)
      {
        char c = (char)reader.Read();
        if (c == '<' && builder == null)
        {
          builder = new StringBuilder();
        }
        if (builder != null)
        {
          builder.Append(c);
        }
        if (xml != null)
        {
          xml.Append(c);
        }

        if (c == '>')
        {
          var token = builder.ToString();
          if (xml == null)
          {
            if (token.StartsWith("<element1", StringComparison.Ordinal) || token.StartsWith("<element2", StringComparison.Ordinal))
            {
              xml = new StringBuilder("<?xml version='1.0' encoding='utf-8' ?>");
              xml.Append(token);
            }
          }
          else
          {
            if (token.StartsWith("</element1", StringComparison.Ordinal) || token.StartsWith("</element2", StringComparison.Ordinal))
            {
              XElement element = XElement.Parse(xml.ToString());
              // do something with the element
              xml = null;
            }
          }
          builder = null;
        }
      }
    }
  }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.