@datagica/read-document NPM

@datagica/read-document

Extract plain text from any kind of document. Based on textract.

Current issues

read-document is not thread safe (because it uses textract, and textract is not apparently), so you will have to wait for each promise to complete before converting another document, for instance by chaining promises like this:

const read = require('@datagica/read-document');

const sequentialPromise = files.reduce((p, file) =>
  p.then(done =>
    read({ file: file }).then(doc => anotherAsyncPromise(doc))
  ),
  Promise.resolve(0)
)

Prerequisites

PDF extraction requires pdftotext be installed
DOC, RTF extraction requires catdoc be installed, unless on OSX in which case textutil (installed by default) is used.
PNG, JPG and GIF require tesseract to be available. Images need to be pretty clear, high - DPI and made almost entirely of just text for tesseract to be able to accurately extract the text.
DXF extraction requires drawingtotext be available

papaparse base64-arraybuffer chrono-node gexf p-queue textract

datanote-service-file2doc @datagica/datanote-api-engine @datagica/datanote-service-file2doc @datagica/datanote-service-file2text @datagica/import-document

8 years ago

8 years ago

9 years ago

9 years ago

10 years ago

10 years ago

10 years ago