2

I am trying to read a 3GB XML file through a URl and store all the jobs in dataset. XML looks like this:

    <?xml version="1.0"?>
    <feed total="1621473">
      <job>
        <title><![CDATA[Certified Medical Assistant]]></title>
        <date>2016-03-25 14:19:38</date>
        <referencenumber>2089677765</referencenumber>
        <url><![CDATA[http://www.jobs2careers.com/click.php?id=2089677765.1347]]></url>
        <company><![CDATA[Broadway Medical Clinic]]></company>
        <city>Portland</city>
        <state>OR</state>
        <zip>97213</zip>
     </job>
     <job>
        <title><![CDATA[Certified Medical Assistant]]></title>
        <date>2016-03-25 14:19:38</date>
        <referencenumber>2089677765</referencenumber>
        <url><![CDATA[http://www.jobs2careers.com/click.php?id=2089677765.1347]]></url>
        <company><![CDATA[Broadway Medical Clinic]]></company>
        <city>Portland</city>
        <state>OR</state>
        <zip>97213</zip>
     </job>
    </feed>

This is my code

XmlDocument doc = new XmlDocument();
            doc.Load(url);
            DataSet ds = new DataSet();
            XmlNodeReader xmlReader = new XmlNodeReader(doc);

            while (xmlReader.ReadToFollowing("job"))
            {
                ds.ReadXml(xmlReader);
            }

But I got memory out of bound exception. Browsed on google and found this:

DataSet ds = new DataSet();
        FileStream filestream = File.OpenRead(url);
        BufferedStream buffered = new BufferedStream(filestream);
        ds.ReadXml(buffered);

still the same exception. I also read about XmlTextReader but i don't know how to make use of it in my case. I know why i am getting the exception but i don't know how to overcome that.Thanks

4
  • What are the exception details? I suspect it could be the XmlDocument that is throwing the 'OutOfMemoryException'. The reason is because I put together some code to generate a large XML file and before I can generate enough data, the XmlDocument object I build is throwing. Maybe related to the internal collection of nodes ({System.Collections.ListDictionaryInternal.NodeKeyValueCollection}). Commented Mar 29, 2016 at 22:08
  • What output do you want? I don't understand "sore all the jobs". Commented Mar 29, 2016 at 22:31
  • @MichaelKay: My bad, edited . I want to store all the jobs in dataset so later i can store all in a database table. Commented Mar 30, 2016 at 0:10
  • @Stringfellow calling the load method on XMLDocument instance tries to load the whole file at once. The file is 3 GB and so the exception happens. Commented Mar 30, 2016 at 0:13

2 Answers 2

2

Instead of trying to load the entire file into the DataSet or other container, how about loading batches and write each batch to the database so whatever is holding the batch can be cleared each time?

How to: Perform Streaming Transform of Large XML Documents https://msdn.microsoft.com/en-us/library/bb387013.aspx

        List<XElement> jobs = new List<XElement>();
        using (XmlReader reader = XmlReader.Create(filePath))
        {
            XElement job;
            reader.MoveToContent();
            while (reader.Read())
            {
                if ((reader.NodeType == XmlNodeType.Element) && (reader.Name == "job"))
                {
                    job = XElement.ReadFrom(reader) as XElement;
                    jobs.Add(job);

                    if (jobs.Count >= 1000)
                    {
                        // TODO: write batch to database
                        jobs.Clear();
                    }
                }
            }

            if (jobs.Count > 0)
            {
                // TODO: write remainder to database
                jobs.Clear();
            }

        }

Alternative using a DataSet.

        DataSet ds = new DataSet();
        using (XmlReader reader = XmlReader.Create(filePath))
        {
            reader.MoveToContent();
            while (reader.Read())
            {
                if ((reader.NodeType == XmlNodeType.Element) && (reader.Name == "job"))
                {
                    ds.ReadXml(reader);

                    DataTable dt = ds.Tables["job"];
                    if (dt.Rows.Count >= 1000)
                    {
                        // TODO: write batch to database
                        dt.Rows.Clear();
                    }
                }
            }

            if (ds.Tables["job"].Rows.Count > 0)
            {
                // TODO: write remainder to database
                ds.Tables["job"].Rows.Clear();
            }
        }
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your time. And using this code how to populate my Dataset?
I added an alternative. Is that what you meant about loading a DataSet? I don't know if you can load the entire 3 GB file into a DataSet without encountering the memory problem. Also, by batching you can enable a 'resume' scenario in case processing fails part way.
Dataset gets populated by 2 rows and after that the first if statement becomes false, any idea why? still working on it. Your solution sounds firm, I will let you know
0

The doc.Load() is going to read entire file and give error. XmlNodeReader will not really do anything for you. Try this

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.Data;

namespace ConsoleApplication1
{
    class Program
    {
        const string url = @"c:\temp\test.xml";
        static void Main(string[] args)
        {
            int count = 0;
            DataSet ds = new DataSet();
            XmlReader xmlReader = XmlReader.Create(url);
            xmlReader.MoveToContent();
            try
            {
                while (!xmlReader.EOF)
                {
                    count++;
                    xmlReader.ReadToFollowing("job");
                    if (!xmlReader.EOF)
                    {
                        ds.ReadXml(xmlReader);
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine("Count : {0}", count);
                Console.ReadLine();
            }
            
        }
    }

}

4 Comments

I still get System.OutOfMemoryException on ds.ReadXml()
I updated code to remove some typo errors. Not sure if it will fix issue. Do you know how many rows are read job elements are read before exception?
Thank you for your time. No still the same exception. I tried to debug it but no it doesn't let me know how many rows are read. I guess there must be a way to either break the xml file into chunks and then read them one by one or else read the file through a buffer so not the whole file gets loaded at once. I just don't know how to achieve it.
Add an exception handler to get count. You just may be using more memory than you computer has.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.