Text-extractors NPM

text-extractors

A text extractor for extracting text from HTML, PDF, Image and other files.

Currently supported types ...

HTML, use html-to-text
PDF, use pdfjs
Image (PNG, JPEG, GIF, BMP, TIFF, ICO, SVG). Use tesseract.js for OCR.
... and more to come

Installation

npm install text-extractors

Usage

CommonJS

const { fromUrl, fromBufferWithMimeType, fromBuffer } = require('text-extractors');

// fromUrl
const text = await fromUrl('https://www.digital.go.jp/assets/contents/node/basic_page/field_ref_resources/d6cfdcdd-75e4-460c-9ec0-af4f952e03d5/20210906_meeting_promoting_01.pdf');

// fromBufferWithMimeType
const text = await fromBufferWithMimeType(buffer, 'image/png');

// fromBuffer
const text = await fromBuffer(buffer);

ES6

import { fromUrl, fromBufferWithMimeType, fromBuffer } from 'text-extractors';

Roadmap

Add support for more file types
Add support for options passed to the underlying libraries

axios content-type detect-file-type html-to-text iconv-lite pdfjs-dist tesseract.js

@everything-registry/sub-chunk-2929

2 years ago

2 years ago

2 years ago

2 years ago

2 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago