0.1.2 • Published 8 years ago
@datagica/read-document v0.1.2
@datagica/read-document
Extract plain text from any kind of document. Based on textract.
Current issues
read-document is not thread safe (because it uses textract, and textract is
not apparently), so you will have to wait for each promise to complete before
converting another document, for instance by chaining promises like this:
const read = require('@datagica/read-document');
const sequentialPromise = files.reduce((p, file) =>
p.then(done =>
read({ file: file }).then(doc => anotherAsyncPromise(doc))
),
Promise.resolve(0)
)Prerequisites
- PDF extraction requires
pdftotextbe installed - DOC, RTF extraction requires
catdocbe installed, unless on OSX in which casetextutil(installed by default) is used. - PNG, JPG and GIF require
tesseractto be available. Images need to be pretty clear, high - DPI and made almost entirely of just text for tesseract to be able to accurately extract the text. - DXF extraction requires
drawingtotextbe available