@arbs.io/extract-text-content NPM

This npm package, @arbs.io/extract-text-content, offers a straightforward method to extract text content from various binary and text file formats. The package comes with a pre-built configuration that works out-of-the-box, requiring no additional setup. It is designed for use in Node.js environments, including Visual Studio Code extensions.

Supported MIME Types

The current version of the package supports extraction from the following MIME types:

PDF: application/pdf
DOCX: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Markdown: text/markdown
CSV text/csv
HTML text/html
Plain Text text/plain

Requesting Additional File Support

If you would like to request support for additional file formats, please submit an enhancement issue on the project's repository. We appreciate your feedback and contributions to improve this package for developers.

Feel free to explore the documentation for more details on how to use this package effectively in your projects. Happy coding!

Install

npm install @arbs.io/extract-text-content

If you use it with Webpack, you need the latest Webpack version and ensure you configure it correctly for ESM.

Usage

Node.js

Extract text from file using binary format. If the file type is binary the mime-type is verified using file-type.

import { extractTextFromFile } from '@arbs.io/extract-text-content'

const pdfPath = './data/microservices.pdf'
extractTextFromFile({
  filepath: pdfPath,
}).then((results) => {
  console.log(`pdf (${pdfPath})`)
  console.log(`\t- mime-type: ${results.mimeType}`)
  console.log(`\t- char-count: ${results.content.length}`)
  console.log(`\t- random-read: ${results.content.substring(2500, 2540)}`)
})

Extract text from file using text format specifiying the mime-type to be used.

const htmlType = 'text/html'
const htmlPath = './data/microservices.htm'
extractTextFromFile({
  filepath: htmlPath,
  filetype: htmlType,
}).then((results) => {
  console.log(`html (${htmlPath})`)
  console.log(`\t- mime-type: ${results.mimeType}`)
  console.log(`\t- char-count: ${results.content.length}`)
  console.log(`\t- random-read: ${results.content.substring(2500, 2540)}`)
})

API

Response type

The TextExtract object provides the following properties

mimeType: The mime-type is set to the format of the data send to the function.
content: The raw text from the files

interface TextExtract {
  mimeType: string
  content: string
}

extractTextFromFile

This package also offers a convenient function, extractTextFromFile, which extracts text content from various file formats using the provided file path or URL. Below is a detailed explanation of the parameters accepted by this function:

extractTextFromFile(filepath: string, filetype?: string): Promise Parameters

filepath (Required): A string representing the path or URL to the file from which you want to extract text content. This parameter must be provided for the function to locate and process the input file.
filetype (Optional): A string that serves as a hint for the file format being loaded. For binary formats, this hint will be validated based on the binary format's magic number. If not provided, the function will attempt to determine the file type automatically.

function extractTextFromFile({
  filepath,
  filetype,
}: {
  filepath: string
  filetype?: string
}): Promise<TextExtract>

By using these parameters with the extractTextFromFile function, you can easily extract text content from supported file formats in your projects by providing a file path or URL.

extractTextFromBuffer

This package offers a primary function, extractTextFromBuffer, which is used to extract text content from various file formats. Below is a detailed explanation of the parameters accepted by this function:

extractTextFromBuffer(bufferArray: Uint8Array, filetype?: string): Promise

Parameters

bufferArray (Required): A Uint8Array representation of the data blob. This parameter must be provided for the function to process and extract text content from the input file.
filetype (Optional): A string that serves as a hint for the file format being loaded. For binary formats, this hint will be validated based on the binary format's magic number. If not provided, the function will attempt to determine the file type automatically.

function extractTextFromBuffer({
  bufferArray,
  filetype,
}: {
  bufferArray: Uint8Array
  filetype: string
}): Promise<TextExtract>

By using these parameters with the extractTextFromBuffer function, you can easily extract text content from supported file formats in your projects.