0.2.2 • Published 8 years ago

training-data-compressor v0.2.2

Weekly downloads
2
License
ISC
Repository
github
Last release
8 years ago

Training Data Compressor

This module will compress/decompress documents in a folder by extracting their text and gzipping them all individually. Afterwards the folder is tarred and gzipped. The module exposes functions with which you can:

  • Compress documents (pdf, docx, etc.) by extracting the raw text of them and gzipping them.
  • Decompress the same gzipped raw text documents, so they can be used by your machine learning algorithm

Installation

npm install --save training-data-compressor

Motive

Training data can take up alot of space, especially if your data depends on Word and PDF files. If you only need the raw text from these documents to train and not the styling/markup, you can preprocess your documents. By preprocessing these files through compression and text extraction, you can save alot of space.

Formats that can be extracted and compressed

This module uses textract under the hood, and therefore can only compress files that textract supports. View this link to see which formats you can use. Also note that certain filetypes have certain requirements that need to be installed on the computer.

Compress documents in folder

Note that the following scripts are memory intensive and will take a while to finish if you work with large amounts of files. This will extract the contents of documents and compresses them into .gz files. The outputDir will be tarred and gzipped.

var TrainingDataCompressor = require('training-data-compressor');
var inputDir = './training-data';
var outputDir = './training-data-compressed';
var saveFilesAs = '.txt';
var filesToCompress = ['.docx', '.pdf'];

TrainingDataCompressor.compressFilesToPath(inputDir, outputDir, saveFilesAs, filesToCompress, function (err) {
    if (err) console.log(err);

    console.log('done compressing files, files are compressed at: ', outputDir + '.tar.gz');
});

Decompress documents in folder

The files that have been compressed, as seen in the example above, can also be decompressed.

var TrainingDataCompressor = require('training-data-compressor');
var zippedDir = './training-data-compressed.tar.gz';
var outputDir = './training-data-decompressed';

TrainingDataCompressor.decompressFilesFromPath(zippedDir, outputDir, function (err) {
    if (err) console.log(err);

    console.log('done decompressing files at: ', outputDir);
});
0.2.2

8 years ago

0.2.1

8 years ago

0.2.0

8 years ago

0.1.0

8 years ago

0.0.2

8 years ago

0.0.1

8 years ago