Loading large XML on DataSet (OutOfMemory Exception)

Question

I am trying to read a 3GB XML file through a URl and store all the jobs in dataset. XML looks like this:

    <?xml version="1.0"?>
    <feed total="1621473">
      <job>
        <title><![CDATA[Certified Medical Assistant]]></title>
        <date>2016-03-25 14:19:38</date>
        <referencenumber>2089677765</referencenumber>
        <url><![CDATA[http://www.jobs2careers.com/click.php?id=2089677765.1347]]></url>
        <company><![CDATA[Broadway Medical Clinic]]></company>
        <city>Portland</city>
        <state>OR</state>
        <zip>97213</zip>
     </job>
     <job>
        <title><![CDATA[Certified Medical Assistant]]></title>
        <date>2016-03-25 14:19:38</date>
        <referencenumber>2089677765</referencenumber>
        <url><![CDATA[http://www.jobs2careers.com/click.php?id=2089677765.1347]]></url>
        <company><![CDATA[Broadway Medical Clinic]]></company>
        <city>Portland</city>
        <state>OR</state>
        <zip>97213</zip>
     </job>
    </feed>

This is my code

XmlDocument doc = new XmlDocument();
            doc.Load(url);
            DataSet ds = new DataSet();
            XmlNodeReader xmlReader = new XmlNodeReader(doc);

            while (xmlReader.ReadToFollowing("job"))
            {
                ds.ReadXml(xmlReader);
            }

But I got memory out of bound exception. Browsed on google and found this:

DataSet ds = new DataSet();
        FileStream filestream = File.OpenRead(url);
        BufferedStream buffered = new BufferedStream(filestream);
        ds.ReadXml(buffered);

still the same exception. I also read about XmlTextReader but i don't know how to make use of it in my case. I know why i am getting the exception but i don't know how to overcome that.Thanks

What are the exception details? I suspect it could be the XmlDocument that is throwing the 'OutOfMemoryException'. The reason is because I put together some code to generate a large XML file and before I can generate enough data, the XmlDocument object I build is throwing. Maybe related to the internal collection of nodes ({System.Collections.ListDictionaryInternal.NodeKeyValueCollection}). — Stringfellow
– Stringfellow, Commented Mar 29, 2016 at 22:08
What output do you want? I don't understand "sore all the jobs". — Michael Kay
– Michael Kay, Commented Mar 29, 2016 at 22:31
@MichaelKay: My bad, edited . I want to store all the jobs in dataset so later i can store all in a database table. — Iman
– Iman, Commented Mar 30, 2016 at 0:10
@Stringfellow calling the load method on XMLDocument instance tries to load the whole file at once. The file is 3 GB and so the exception happens. — Iman
– Iman, Commented Mar 30, 2016 at 0:13

Stringfellow · Accepted Answer · 2016-03-30 02:27:01Z

2

Instead of trying to load the entire file into the DataSet or other container, how about loading batches and write each batch to the database so whatever is holding the batch can be cleared each time?

How to: Perform Streaming Transform of Large XML Documents https://msdn.microsoft.com/en-us/library/bb387013.aspx

        List<XElement> jobs = new List<XElement>();
        using (XmlReader reader = XmlReader.Create(filePath))
        {
            XElement job;
            reader.MoveToContent();
            while (reader.Read())
            {
                if ((reader.NodeType == XmlNodeType.Element) && (reader.Name == "job"))
                {
                    job = XElement.ReadFrom(reader) as XElement;
                    jobs.Add(job);

                    if (jobs.Count >= 1000)
                    {
                        // TODO: write batch to database
                        jobs.Clear();
                    }
                }
            }

            if (jobs.Count > 0)
            {
                // TODO: write remainder to database
                jobs.Clear();
            }

        }

Alternative using a DataSet.

        DataSet ds = new DataSet();
        using (XmlReader reader = XmlReader.Create(filePath))
        {
            reader.MoveToContent();
            while (reader.Read())
            {
                if ((reader.NodeType == XmlNodeType.Element) && (reader.Name == "job"))
                {
                    ds.ReadXml(reader);

                    DataTable dt = ds.Tables["job"];
                    if (dt.Rows.Count >= 1000)
                    {
                        // TODO: write batch to database
                        dt.Rows.Clear();
                    }
                }
            }

            if (ds.Tables["job"].Rows.Count > 0)
            {
                // TODO: write remainder to database
                ds.Tables["job"].Rows.Clear();
            }
        }

edited Mar 30, 2016 at 2:27

answered Mar 30, 2016 at 0:39

Stringfellow

2,9683 gold badges25 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Iman Over a year ago

Thank you for your time. And using this code how to populate my Dataset?

Stringfellow Over a year ago

I added an alternative. Is that what you meant about loading a DataSet? I don't know if you can load the entire 3 GB file into a DataSet without encountering the memory problem. Also, by batching you can enable a 'resume' scenario in case processing fails part way.

Iman Over a year ago

Dataset gets populated by 2 rows and after that the first if statement becomes false, any idea why? still working on it. Your solution sounds firm, I will let you know

jdweng · Accepted Answer · 2016-03-30 10:57:17Z

0

The doc.Load() is going to read entire file and give error. XmlNodeReader will not really do anything for you. Try this

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.Data;

namespace ConsoleApplication1
{
    class Program
    {
        const string url = @"c:\temp\test.xml";
        static void Main(string[] args)
        {
            int count = 0;
            DataSet ds = new DataSet();
            XmlReader xmlReader = XmlReader.Create(url);
            xmlReader.MoveToContent();
            try
            {
                while (!xmlReader.EOF)
                {
                    count++;
                    xmlReader.ReadToFollowing("job");
                    if (!xmlReader.EOF)
                    {
                        ds.ReadXml(xmlReader);
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine("Count : {0}", count);
                Console.ReadLine();
            }
            
        }
    }

}

edited Mar 30, 2016 at 10:57

answered Mar 29, 2016 at 19:42

jdweng

34.5k3 gold badges17 silver badges19 bronze badges

4 Comments

Iman Over a year ago

I still get System.OutOfMemoryException on ds.ReadXml()

jdweng Over a year ago

I updated code to remove some typo errors. Not sure if it will fix issue. Do you know how many rows are read job elements are read before exception?

Iman Over a year ago

Thank you for your time. No still the same exception. I tried to debug it but no it doesn't let me know how many rows are read. I guess there must be a way to either break the xml file into chunks and then read them one by one or else read the file through a buffer so not the whole file gets loaded at once. I just don't know how to achieve it.

jdweng Over a year ago

Add an exception handler to get count. You just may be using more memory than you computer has.

Collectives™ on Stack Overflow

Loading large XML on DataSet (OutOfMemory Exception)

2 Answers 2

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related