@the-grid/nikita NPM

Nikita: Content extraction from documents

TODO

Fix hardcoded temporary directory, clean up after upload
Test some PDF files
Add a wrapper graph which takes an object in, enriches then sends out again
Integrate with AMQP, add as worker in thegrid-apis

Later

Test how much faster Tika Java API is at XHTML + image extraction over cli tools
Avoid temporary files for images+html output if/when passing to NoFlo

Setup

Configuration is passed as environment variables:

AMAZON_API_ID: Amazon S3 API identifier
AMAZON_API_TOKEN: Amazon S3 API token/secret
AMAZON_API_REGION: Amazon S3 region, ex: 'us-west-2'
AMAZON_API_BUCKET: Amazon S3 bucket for uploaded files, ex: 'thegrid-user-content'

Design

Separate Heroku worker, integrated into TheGrid APIs.

Inputs:

URL to s3 backed document (Word,PDF)

Outputs:

Extracted HTML with img src referring to S3 backend

Notes:

Tika provides full XHTML document, where as Embed.ly gives only (and we expect)

cheerio coffee-script commander grunt-mocha-test js-yaml node-uuid noflo noflo-core noflo-groups noflo-objects noflo-packets noflo-s3 noflo-strings noflo-tika

@everything-registry/sub-chunk-911 @zalastax/nolb-_the-

10 years ago

10 years ago

10 years ago

10 years ago

10 years ago

10 years ago