Parsing large xml files ( more than 500MB ) seems to be very tedious if you are using Node.js. Many parser out there do not handle large size xml files and throw this error
FATAL ERROR JS Allocation failed – process out of memory
SAX xml parser handles large xml files but due to it’s complexity in handling events to capture specific xml node and data, we do not recommend this package either.
This code is tested on Ubuntu 14.04. Due to some dependency issue it may not run on Windows.
What’s our Requirement ?
We wanted XML parser which parse large xml files ( our is 635 megabyte) and allow us to convert it into JSON format for further use or simply allow us to extract only those data which we want and let us traverse through it easily.
xml-stream Parser:
After testing all most every high reputed parser ( reputation in terms of downloads daily ) we found this awesome parser which work exactly the way our requirement was.
Install it using following command.
How to use xml-stream:
xml-stream is simple and fast. To use xml-stream, require it in your project and pass the ReadFile object to initialize it. See how to initialize it.
var XmlStream = require('xml-stream');
/*
* Pass the ReadStream object to xml-stream
*/
var stream=fs.createReadStream('file_name.xml');
var xml = new XmlStream(stream);
/*
*Further code.
*/
How it works !
xml-stream parse the xml content and output them in array structure. Here see the example.
Input XML
<item id="123" type="common">
<title>Item Title</title>
<description>Description of this item.</description>
(text)
</item>
Parser Output:
{
title: 'Item Title',
description: 'Description of this item.',
'$': {
'id': '123',
'type': 'common'
},
'$name': 'item',
'$text': '(text)'
}
Extract specific xml node:
<title>Item Title</title>
<description>Description of this item.</description>
(text)
</item>
title: 'Item Title',
description: 'Description of this item.',
'$': {
'id': '123',
'type': 'common'
},
'$name': 'item',
'$text': '(text)'
}
Here comes the interesting part, suppose you have large xml file like i have and you want to extract only those information which are enclosed in specific xml node. xml-stream provides ‘preserve’ and ‘collect’ function to do so. See example.
XML file Content
<media mediaId="value" lastModified="date" action="add">
<title size="140" type="full" lang="en">Some title</title>
<ids>
<id type="rootId">10000020</id>
<id type="seriesId">10000020</id>
<id type="TMSId">SH017461480000</id>
</ids>
<image type="image/jpg" width="270" height="360" primary="true" category="Banner">
<URI>Some URL</URI>
<caption lang="en">Some title</caption>
</image>
</media>
<media mediaId="p10000020_b_v4_aa" lastModified="2013-06-14T00:00:00Z" action="add">
<title size="140" type="full" lang="en">Some title</title>
<ids>
<id type="rootId">10000020</id>
<id type="seriesId">10000020</id>
<id type="TMSId">SH017461480000</id>
</ids>
<image type="image/jpg" width="540" height="720" primary="true" category="Banner">
<URI>Some URL</URI>
<caption lang="en">Some title</caption>
</image>
</media>
</xml>
Now i want to extract only values of <id> and print them. Here is a code to do so.
var XmlStream = require('xml-stream') ;
var stream=fs.createReadStream('tvbanners.xml');
var xml = new XmlStream(stream);
xml.preserve('id', true);
xml.collect('subitem');
xml.on('endElement: id', function(item) {
console.log(item);
});
Parser Output:
I have run the command and put the output in text file using this.
Here is my output text file.
'$': { type: 'rootId' },
'$text': '10000020',
'$name': 'id' }
{ '$children': [ '10000020' ],
'$': { type: 'seriesId' },
'$text': '10000020',
'$name': 'id' }
{ '$children': [ 'SH017461480000' ],
'$': { type: 'TMSId' },
'$text': 'SH017461480000',
'$name': 'id' }
.
.
....more content
If you want to print specific xml node content, you can do by using
Or
to go inside the array of array.
This is it for now. Ask any doubt if you have in comments.