0.1.1 • Published 11 months ago

@arbs.io/extract-text-content v0.1.1

Weekly downloads
-
License
MIT
Repository
github
Last release
11 months ago

This npm package, @arbs.io/extract-text-content, offers a straightforward method to extract text content from various binary and text file formats. The package comes with a pre-built configuration that works out-of-the-box, requiring no additional setup. It is designed for use in Node.js environments, including Visual Studio Code extensions.

Maintainability Rating Security Rating Reliability Rating Bugs Vulnerabilities

Supported MIME Types

The current version of the package supports extraction from the following MIME types:

  • PDF: application/pdf
  • DOCX: application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • Markdown: text/markdown
  • CSV text/csv
  • HTML text/html
  • Plain Text text/plain

Requesting Additional File Support

If you would like to request support for additional file formats, please submit an enhancement issue on the project's repository. We appreciate your feedback and contributions to improve this package for developers.

Feel free to explore the documentation for more details on how to use this package effectively in your projects. Happy coding!

Install

npm install @arbs.io/extract-text-content

If you use it with Webpack, you need the latest Webpack version and ensure you configure it correctly for ESM.

Usage

Node.js

Extract text from file using binary format. If the file type is binary the mime-type is verified using file-type.

import { extractTextFromFile } from '@arbs.io/extract-text-content'

const pdfPath = './data/microservices.pdf'
extractTextFromFile({
  filepath: pdfPath,
}).then((results) => {
  console.log(`pdf (${pdfPath})`)
  console.log(`\t- mime-type: ${results.mimeType}`)
  console.log(`\t- char-count: ${results.content.length}`)
  console.log(`\t- random-read: ${results.content.substring(2500, 2540)}`)
})

Extract text from file using text format specifiying the mime-type to be used.

const htmlType = 'text/html'
const htmlPath = './data/microservices.htm'
extractTextFromFile({
  filepath: htmlPath,
  filetype: htmlType,
}).then((results) => {
  console.log(`html (${htmlPath})`)
  console.log(`\t- mime-type: ${results.mimeType}`)
  console.log(`\t- char-count: ${results.content.length}`)
  console.log(`\t- random-read: ${results.content.substring(2500, 2540)}`)
})

API

Response type

The TextExtract object provides the following properties

  • mimeType: The mime-type is set to the format of the data send to the function.
  • content: The raw text from the files
interface TextExtract {
  mimeType: string
  content: string
}

extractTextFromFile

This package also offers a convenient function, extractTextFromFile, which extracts text content from various file formats using the provided file path or URL. Below is a detailed explanation of the parameters accepted by this function:

extractTextFromFile(filepath: string, filetype?: string): Promise Parameters

  • filepath (Required): A string representing the path or URL to the file from which you want to extract text content. This parameter must be provided for the function to locate and process the input file.

  • filetype (Optional): A string that serves as a hint for the file format being loaded. For binary formats, this hint will be validated based on the binary format's magic number. If not provided, the function will attempt to determine the file type automatically.

function extractTextFromFile({
  filepath,
  filetype,
}: {
  filepath: string
  filetype?: string
}): Promise<TextExtract>

By using these parameters with the extractTextFromFile function, you can easily extract text content from supported file formats in your projects by providing a file path or URL.

extractTextFromBuffer

This package offers a primary function, extractTextFromBuffer, which is used to extract text content from various file formats. Below is a detailed explanation of the parameters accepted by this function:

extractTextFromBuffer(bufferArray: Uint8Array, filetype?: string): Promise

Parameters

  • bufferArray (Required): A Uint8Array representation of the data blob. This parameter must be provided for the function to process and extract text content from the input file.

  • filetype (Optional): A string that serves as a hint for the file format being loaded. For binary formats, this hint will be validated based on the binary format's magic number. If not provided, the function will attempt to determine the file type automatically.

function extractTextFromBuffer({
  bufferArray,
  filetype,
}: {
  bufferArray: Uint8Array
  filetype: string
}): Promise<TextExtract>

By using these parameters with the extractTextFromBuffer function, you can easily extract text content from supported file formats in your projects.

Dependancies

The liberary uses the following packages (many thanks for the authors)

0.1.1

11 months ago

0.1.0

11 months ago

0.0.9

11 months ago

0.0.8

11 months ago

0.0.7

11 months ago

0.0.6

11 months ago

0.0.5

11 months ago

0.0.4

11 months ago

0.0.3

11 months ago

0.0.2

11 months ago

0.0.1

11 months ago