menu

Parse large XML files in Node

Parsing large xml files ( more than 500MB ) seems to be very tedious if you are using Node.js. Many parser out there do not handle large size xml files and throw this error

FATAL ERROR JS Allocation failed – process out of memory

SAX xml parser handles large xml files but due to it’s complexity in handling events to capture specific xml node and data, we do not recommend this package either.

This code is tested on Ubuntu 14.04. Due to some dependency issue it may not run on Windows.

What’s our Requirement ?

We wanted XML parser which parse large xml files ( our is 635 megabyte) and allow us to convert it into JSON format for further use or simply allow us to extract only those data which we want and let us traverse through it easily.

xml-stream Parser:

After testing all most every high reputed parser ( reputation in terms of downloads daily ) we found this awesome parser which work exactly the way our requirement was.

Install it using following command.

npm install -g xml-stream

How to use xml-stream:

xml-stream is simple and fast. To use xml-stream, require it in your project and pass the ReadFile object to initialize it. See how to initialize it.

var fs        = require('fs');
var XmlStream = require('xml-stream');
/*
   * Pass the ReadStream object to xml-stream
*/

var stream=fs.createReadStream('file_name.xml');
var xml = new XmlStream(stream);
/*
  *Further code.
*/

How it works !

xml-stream parse the xml content and output them in array structure. Here see the example.

Input XML

<item id="123" type="common">
  <title>Item Title</title>
  <description>Description of this item.</description>
  (text)
</item>

Parser Output:

{
  title: 'Item Title',
  description: 'Description of this item.',
  '$': {
    'id': '123',
    'type': 'common'
  },
  '$name': 'item',
  '$text': '(text)'
}

Extract specific xml node:

Here comes the interesting part, suppose you have large xml file like i have and you want to extract only those information which are enclosed in specific xml node. xml-stream provides ‘preserve’ and ‘collect’ function to do so. See example.

XML file Content

<?xml version="1.0" encoding="UTF-8"?>
<media mediaId="value" lastModified="date" action="add">
<title size="140" type="full" lang="en">Some title</title>
<ids>
<id type="rootId">10000020</id>
<id type="seriesId">10000020</id>
<id type="TMSId">SH017461480000</id>
</ids>
<image type="image/jpg" width="270" height="360" primary="true" category="Banner">
<URI>Some URL</URI>
<caption lang="en">Some title</caption>
</image>
</media>
<media mediaId="p10000020_b_v4_aa" lastModified="2013-06-14T00:00:00Z" action="add">
<title size="140" type="full" lang="en">Some title</title>
<ids>
<id type="rootId">10000020</id>
<id type="seriesId">10000020</id>
<id type="TMSId">SH017461480000</id>
</ids>
<image type="image/jpg" width="540" height="720" primary="true" category="Banner">
<URI>Some URL</URI>
<caption lang="en">Some title</caption>
</image>
</media>
</xml>

Now i want to extract only values of <id> and print them. Here is a code to do so.

var fs        = require('fs')
var XmlStream = require('xml-stream') ;
var stream=fs.createReadStream('tvbanners.xml');
var xml = new XmlStream(stream);
xml.preserve('id', true);
xml.collect('subitem');
xml.on('endElement: id', function(item) {
  console.log(item);
});

Parser Output:

I have run the command and put the output in text file using this.

node server.js > output.txt

Here is my output text file.

{ '$children': [ '10000020' ],
  '$': { type: 'rootId' },
  '$text': '10000020',
  '$name': 'id' }
{ '$children': [ '10000020' ],
  '$': { type: 'seriesId' },
  '$text': '10000020',
  '$name': 'id' }
{ '$children': [ 'SH017461480000' ],
  '$': { type: 'TMSId' },
  '$text': 'SH017461480000',
  '$name': 'id' }
   .
   .
   ....more content

If you want to print specific xml node content, you can do by using

console.log(item['$text']);

Or

console.log(item['$']['type']);

to go inside the array of array.

This is it for now. Ask any doubt if you have in comments.