0

I try using sax js in order to process xml file in chunks , adding it to limited size of array , await to updating the DB , then reset the array.

The problem is that it not working asynchronously out of the box and didnt find yet the solution to update DB without loosing data

I'm using the lts version of sax js

const gunzip = zlib.createGunzip();
const xmlStream = fs.createReadStream(path).pipe(gunzip);
const saxStream = sax.createStream(true);
const BATCH_SIZE = 100;



let documents = [];
let currentElement = {};
let currentNode = null; 
let isProcessing = false;


saxStream.on("opentag", function (node) {
    currentNode = node.name;
    if (node.name === "Item") {
      currentElement = {}; // Initialize a new empty object for each item
    }
  });

saxStream.on("text", (text) => {
    if (currentElement) {
      doSomthing(text)
    }
  });

saxStream.on("closetag", async function (name) {
    if (name === "Item" && currentElement && documents.length < BATCH_SIZE) {
      documents.push(currentElement );
      currentElement = {}; // Reset for the next item

      if (documents.length === BATCH_SIZE && !isProcessing) {
        isProcessing = true;
        console.log("1. Start process batch of size", documents.length);
        await insertDocuments(documents);
        console.log("1. End process batch of size", documents.length);
        documents = []; // Reset the documents array after processing
        isProcessing = false;
      }
    }

    currentNode = null; // Reset the current node name
  });


xmlStream.pipe(saxStream);

So ofc it doesnt await to the insertDocuments and continue

so I tried to combined pausing the stream using xmlStream.pause() and xmlStream.resume() before and after the async method

 if (documents.length === BATCH_SIZE && !isProcessing) {
        isProcessing = true;
        xmlStream.pause();
        console.log("1. Start process batch of size", documents.length);
        await insertDocuments(documents);
        console.log("1. End process batch of size", documents.length);
        documents = []; // Reset the documents array after processing
        isProcessing = false;
        xmlStream.resume();
      }

its not good solution due to we stop the xmlStream but not the sax stream and then we loosing data .

I also tried using this._parser.close() and this._parser.resume() and couldn't find any solution .

The output of logs of this code is :

  1. Start process batch of size 100 processed: 100
  2. End process batch of size 100
  3. start process batch of size 17 processed: 17
  4. End process batch of size 17 XML parsing completed.

** in the test I used small file while the total items are 133 as we can see 16 items not processed

4
  • It might be useful to identify exactly which XML parser (and version) you are using; I believe there are several. Commented Aug 28, 2024 at 8:36
  • I am using sax js lts version. but I'm open to change to other library that well maintained and perform fast parsing. (already updated it in body of the main query) Commented Aug 28, 2024 at 9:41
  • Why are you reading in chunks? Is it due to out of memory or are you trying to process while data is being received? You can convert the following c# to Powershell : stackoverflow.com/questions/2263852/… Commented Aug 28, 2024 at 11:46
  • @jdweng im processing chunk in order to reduce memory Commented Aug 28, 2024 at 20:21

1 Answer 1

1

Use Powershell with XmlReader to reduce the memory. See sample code below

using assembly System.Xml.Linq

$xmlFilename = 'c:\temp\test.xml'
$csvFilename = 'c:\temp\test.csv'

$writer = [System.IO.StreamWriter]::new($csvFilename)

$writer.WriteLine('time,linkID,CO,CO2Total')

$reader = [System.Xml.XmlReader]::Create($xmlFilename)
While(-not $reader.EOF)
{
    if($reader.Name -ne 'event')
    {
        $reader.ReadToFollowing('event') | out-null;
    }
    if(-not $reader.EOF)
    {
        $event = [System.Xml.Linq.XElement][System.Xml.Linq.XElement]::ReadFrom($reader);
        $type = $event.Attribute('type').Value
        if($type -eq 'warmEmissionEvent')
        {
           $time = $event.Attribute('type').Value
           $linkId = $event.Attribute('linkId').Value
           $CO = $event.Attribute('CO').Value
           $CO2Total = $event.Attribute('CO2_TOTAL').Value
           $writer.WriteLine([string]::Join(',',@($time,$linkId,$CO,$CO2Total)))
        }
    }
}
$writer.Flush()
$writer.Close()
Sign up to request clarification or add additional context in comments.

2 Comments

I need it in node js
You can run powershell from inside js : stackoverflow.com/questions/55002794/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.