@bj.dev/parsely NPM

Parsely

Parsely is a lightweight JavaScript library for parsing various types of documents and web pages. It provides easy-to-use parsers for extracting and processing data from PDFs, Word documents, Excel spreadsheets, and web pages.

Features

Parse PDF documents and extract text, tables, and metadata.
Parse Word (DOCX) documents to extract raw text.
Parse Excel (XLSX) files to extract data from sheets.
Parse Web Pages to extract HTML content and metadata.
Lightweight and modular design, allowing you to use only the parsers you need.

Installation

Install Parsely via npm:

npm install "@bj.dev/parsely"

Usage

Importing Parsers

import { PDFParser, DocxParser, XlsxParser, WebParser } from "@bj.dev/parsely";

Parsing a PDF Document

const pdfParser = new PDFParser("path/to/document.pdf");

(async () => {
  const result = await pdfParser.parse();
  console.log(result);
})();

Parsing a Word Document

const docxParser = new DocxParser("path/to/document.docx");

(async () => {
  const text = await docxParser.parse();
  console.log(text);
})();

Parsing an Excel Spreadsheet

const xlsxParser = new XlsxParser("path/to/spreadsheet.xlsx");

(async () => {
  const data = await xlsxParser.parse();
  console.log(data);
})();

Parsing a Web Page

const webParser = new WebParser("https://example.com");

(async () => {
  const html = await webParser.parse();
  console.log(html);
})();

API Reference

PDFParser

Constructor: new PDFParser(filePath)
- filePath (String): Path to the PDF file.
Method: parse()
- Returns a Promise resolving to an object with metadata and text content.

DocxParser

Constructor: new DocxParser(filePath)
- filePath (String): Path to the DOCX file.
Method: parse()
- Returns a Promise resolving to the raw text of the document.

XlsxParser

Constructor: new XlsxParser(filePath)
- filePath (String): Path to the XLSX file.
Method: parse()
- Returns a Promise resolving to an array of sheet data.

WebParser

Constructor: new WebParser(webUrl)
- webUrl (String): URL of the web page to parse.
Method: parse()
- Returns a Promise resolving to the HTML content of the page.

Contributing

Contributions are welcome! If you encounter bugs or have feature requests, please open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Created by Bolaji Bolajoko.

document parser document parser parse word documents parse worksheets parse web pages parse pdf

cheerio mammoth pdf-lib pdfjs-dist tesseract.js xlsx

1 year ago

1 year ago

1 year ago

1 year ago