stream bz xml file #62

leojoubert · 2016-04-28T12:52:54Z

Hi,

I need to parse wikipedia dumps for research purpose. The dumps are very very big (much more than 1TB). These are the file "page-meta-history.xml" -> http://dumps.wikimedia.org
So I can't unzip the file. Can I word with bz file into the xml-steam API ?

Thank you

jbielick · 2016-05-21T22:24:55Z

You can pipe the read stream into an unzip transformer and then pass the stream to XmlStream.

try this:

const request = require('request');
const zlib = require('zlib');

let readStream = request('http://dumps.wikimedia.org/some-dump.xml').pipe(zlib.createGunzip());
let parser  = new XmlStream(readStream);

cigolpl · 2016-06-08T09:45:44Z

@cafeine05 you can try this https://github.com/spencermountain/wikipedia-to-mongodb for doing whole job

In very short decompressing bz in stream has such a logic:

  var bz2 = require('unbzip2-stream');
  var stream = fs.createReadStream(file).pipe(bz2());
  var stream = fs.createReadStream(file);
  var xml = new XmlStream(stream);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stream bz xml file #62

stream bz xml file #62

leojoubert commented Apr 28, 2016

jbielick commented May 21, 2016 •

edited

Loading

cigolpl commented Jun 8, 2016 •

edited

Loading

stream bz xml file #62

stream bz xml file #62

Comments

leojoubert commented Apr 28, 2016

jbielick commented May 21, 2016 • edited Loading

cigolpl commented Jun 8, 2016 • edited Loading

jbielick commented May 21, 2016 •

edited

Loading

cigolpl commented Jun 8, 2016 •

edited

Loading