1.2.3 • Published 8 years ago

@the-grid/nikita v1.2.3

Weekly downloads
-
License
proprietary
Repository
github
Last release
8 years ago

Nikita: Content extraction from documents

TODO

  • Fix hardcoded temporary directory, clean up after upload
  • Test some PDF files
  • Add a wrapper graph which takes an object in, enriches then sends out again
  • Integrate with AMQP, add as worker in thegrid-apis

Later

  • Test how much faster Tika Java API is at XHTML + image extraction over cli tools
  • Avoid temporary files for images+html output if/when passing to NoFlo

Setup

Configuration is passed as environment variables:

AMAZON_API_ID: Amazon S3 API identifier
AMAZON_API_TOKEN: Amazon S3 API token/secret
AMAZON_API_REGION: Amazon S3 region, ex: 'us-west-2'
AMAZON_API_BUCKET: Amazon S3 bucket for uploaded files, ex: 'thegrid-user-content'

Design

Separate Heroku worker, integrated into TheGrid APIs.

Inputs:

  • URL to s3 backed document (Word,PDF)

Outputs:

  • Extracted HTML with img src referring to S3 backend

Notes:

  • Tika provides full XHTML document, where as Embed.ly gives only (and we expect)