Parse large XML files in Node

Parsing large xml files ( more than 500MB ) seems to be very tedious if you are using Node.js. Many parser out there do not handle large size xml files and throw this error

FATAL ERROR JS Allocation failed – process out of memory

SAX xml parser handles large xml files but due to it’s complexity in handling events to capture specific xml node and data, we do not recommend this package either.

This code is tested on Ubuntu 14.04. Due to some dependency issue it may not run on Windows.

What’s our Requirement ?

We wanted XML parser which parse large xml files ( our is 635 megabyte) and allow us to convert it into JSON format for further use or simply allow us to extract only those data which we want and let us traverse through it easily.

xml-stream Parser:

After testing all most every high reputed parser ( reputation in terms of downloads daily ) we found this awesome parser which work exactly the way our requirement was.

Install it using following command.

npm install -g xml-stream

How to use xml-stream:

xml-stream is simple and fast. To use xml-stream, require it in your project and pass the ReadFile object to initialize it. See how to initialize it.

var fs        = require('fs');
var XmlStream = require('xml-stream');
/*
   * Pass the ReadStream object to xml-stream
*/

var stream=fs.createReadStream('file_name.xml');
var xml = new XmlStream(stream);
/*
  *Further code.
*/

How it works !

xml-stream parse the xml content and output them in array structure. Here see the example.

Input XML

<item id="123" type="common">
  <title>Item Title</title>
  <description>Description of this item.</description>
  (text)
</item>

Parser Output:

{
  title: 'Item Title',
  description: 'Description of this item.',
  '$': {
    'id': '123',
    'type': 'common'
  },
  '$name': 'item',
  '$text': '(text)'
}

Extract specific xml node:

Here comes the interesting part, suppose you have large xml file like i have and you want to extract only those information which are enclosed in specific xml node. xml-stream provides ‘preserve’ and ‘collect’ function to do so. See example.

XML file Content

<?xml version="1.0" encoding="UTF-8"?>
<media mediaId="value" lastModified="date" action="add">
<title size="140" type="full" lang="en">Some title</title>
<ids>
<id type="rootId">10000020</id>
<id type="seriesId">10000020</id>
<id type="TMSId">SH017461480000</id>
</ids>
<image type="image/jpg" width="270" height="360" primary="true" category="Banner">
<URI>Some URL</URI>
<caption lang="en">Some title</caption>
</image>
</media>
<media mediaId="p10000020_b_v4_aa" lastModified="2013-06-14T00:00:00Z" action="add">
<title size="140" type="full" lang="en">Some title</title>
<ids>
<id type="rootId">10000020</id>
<id type="seriesId">10000020</id>
<id type="TMSId">SH017461480000</id>
</ids>
<image type="image/jpg" width="540" height="720" primary="true" category="Banner">
<URI>Some URL</URI>
<caption lang="en">Some title</caption>
</image>
</media>
</xml>

Now i want to extract only values of <id> and print them. Here is a code to do so.

var fs        = require('fs')
var XmlStream = require('xml-stream') ;
var stream=fs.createReadStream('tvbanners.xml');
var xml = new XmlStream(stream);
xml.preserve('id', true);
xml.collect('subitem');
xml.on('endElement: id', function(item) {
  console.log(item);
});

Parser Output:

I have run the command and put the output in text file using this.

node server.js > output.txt

Here is my output text file.

{ '$children': [ '10000020' ],
  '$': { type: 'rootId' },
  '$text': '10000020',
  '$name': 'id' }
{ '$children': [ '10000020' ],
  '$': { type: 'seriesId' },
  '$text': '10000020',
  '$name': 'id' }
{ '$children': [ 'SH017461480000' ],
  '$': { type: 'TMSId' },
  '$text': 'SH017461480000',
  '$name': 'id' }
   .
   .
   ....more content

If you want to print specific xml node content, you can do by using

console.log(item['$text']);

Or

console.log(item['$']['type']);

to go inside the array of array.

This is it for now. Ask any doubt if you have in comments.

Shahid (UnixRoot) Shaikh

Hey there, This is Shahid, an Engineer and Blogger from Bombay. I am also an Author and i wrote a programming book on Sails.js, MVC framework for Node.js.

Related Posts

16 Comments

  1. Thank you for the article. I have been working to be able to handle the Facebook public feeds (their version of firehose) and I am able to process about 2100 records/s (80k lines of xml). Is this what I should expect from the library or are there ways to make it even faster? I am using native http module to reduce as much overhead as possible.

    1. Hi,

      I think its good speed and you can analyse the speed as soon as size of file increase. I used to parse 600 MB file which has lots of XML data ( din’t counted the line ) and speed was good.

      Basically XML parsing needs the SAX compiler and that guy is relatively slow than others but yes consistent than others too.

      Hope it helps.

      -Shahid.

  2. Hi,

    Is there a way to collect and preserve everything within the selected nodes. Suppose say I have 3 children inside my my selected node. lets say x,y,z are the children. Currently with your explanation in order to collect all the three children I have to mention something like
    xml.collect(‘x’);
    xml.collect(‘y’);
    xml.collect(‘z’);

    Instead of the above approach is there a way to tell that all the nodes should be collected?
    I don’t know how many children my selected node contains.

  3. great post,

    if I want to render external xml file from another website, and then access the xml from my own server, is it possible?

    1. Yes.

      You need to use the HTTP Library or request npm module to download the file and then render it using the code shown in this tutorial.

  4. Hi Shahid,
    Thank you for great xml parser. How do I extract all the nested ports in this example (should end up as a json array):

    1
    11
    111

    2
    22
    222

    When I use code:
    var stream = fs.createReadStream(xml_filename);
    var xml = new XmlStream(stream);

    //xml.preserve(‘port’, true);
    //xml.collect(‘subitem’);
    xml.on(‘endElement: host’, function(item) {
    console.log(item);
    });

    I only get last of 3 available ports stored to JSON:

    { port: { ‘$’: { type: ‘TMSId’ }, ‘$text’: ‘111’ } }
    { port: { ‘$’: { type: ‘TMSId’ }, ‘$text’: ‘222’ } }

    so the values: ‘1’, ’11’ as well as ‘2’, ’22, are lost : (

    thanks,
    Dmitry

  5. Shahid, sorry my XML didn’t go through properly.. Trying again:

    <media mediaId="value" lastModified="date" action="add"> <host> <port type="rootId">1</port> <port type="seriesId">11</port> <port type="TMSId">111</port> </host> <host> <port type="rootId">2</port> <port type="seriesId">22</port> <port type="TMSId">222</port> </host> </media>

  6. ah – ignore me! solved, using: xml.collect(‘port’); and now output does not loose any of ports: { port:
    [ { ‘$’: [Object], ‘$text’: ‘1’ },
    { ‘$’: [Object], ‘$text’: ’11’ },
    { ‘$’: [Object], ‘$text’: ‘111’ } ] }
    { port:
    [ { ‘$’: [Object], ‘$text’: ‘2’ },
    { ‘$’: [Object], ‘$text’: ’22’ },
    { ‘$’: [Object], ‘$text’: ‘222’ } ] }

    thanks!

  7. Hi, this tutorial works like a charm for a large file but now I need to deal with the namespace. Can you help me with this?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.