1.3.2 β€’ Published 8 months ago

uniparser v1.3.2

Weekly downloads
-
License
MIT
Repository
github
Last release
8 months ago

πŸ“œ UniParser: Universal File Parsing for Node.js

UniParser is a powerful, lightweight Node.js library designed to handle parsing of multiple file formatsβ€”such as PDF, DOCX, TXT, HTML, and Markdownβ€”and convert them into plain text with ease.

πŸš€ Say goodbye to file format limitations! UniParser extracts text content from all these formats, providing a consistent text output for your applications.


✨ Features

  • πŸ” PDF Parsing: Extracts plain text from PDF documents.
  • πŸ“ DOCX Parsing: Reads and extracts text from Microsoft Word .docx files.
  • πŸ“„ TXT Parsing: Handles plain text files with no special formatting.
  • 🌐 HTML Parsing: Extracts text from the body of HTML documents.
  • 🎨 Markdown Parsing: Converts Markdown files to plain text, stripping out all formatting syntax.
  • πŸ”„ Auto-detection: Automatically detects the file format and parses it using the autoParse function.

πŸ“¦ Installation

To install UniParser, simply run:

npm install uniparser

πŸ› οΈ Usage

CommonJS (CJS) Example

If you’re working in a Node.js environment with CommonJS (CJS), use require() to import UniParser:

const { autoParse, parsePDF, parseDOCX, parseTXT, parseHTML, parseMarkdown } = require('uniparser');

// Example: Automatically detect and parse a file
(async () => {
    const parsedText = await autoParse('./path/to/sample-file.pdf');
    console.log(parsedText);
})();

// Example: Parse specific file types
const pdfText = await parsePDF('./path/to/sample-file.pdf');
const docxText = await parseDOCX('./path/to/sample-file.docx');
const txtText = parseTXT('./path/to/sample-file.txt');
const htmlText = parseHTML('./path/to/sample-file.html');
const markdownText = parseMarkdown('./path/to/sample-file.md');

ES Modules (ESM) Example

If you’re working in an ES Module environment (modern JavaScript), use import to load the functions:

import { autoParse, parsePDF, parseDOCX, parseTXT, parseHTML, parseMarkdown } from 'uniparser';

// Example: Automatically detect and parse a file
(async () => {
    const parsedText = await autoParse('./path/to/sample-file.pdf');
    console.log(parsedText);
})();

// Example: Parse specific file types
const pdfText = await parsePDF('./path/to/sample-file.pdf');
const docxText = await parseDOCX('./path/to/sample-file.docx');
const txtText = parseTXT('./path/to/sample-file.txt');
const htmlText = parseHTML('./path/to/sample-file.html');
const markdownText = parseMarkdown('./path/to/sample-file.md');

⚑ Synchronous Usage (for small files)

For small files, you can use UniParser synchronously, but this should only be done for very lightweight files.

CommonJS (CJS):

const { parseTXT, parseMarkdown } = require('uniparser');

// Synchronously read small text files
const txtContent = parseTXT('./path/to/sample-file.txt');
console.log(txtContent);

const markdownContent = parseMarkdown('./path/to/sample-file.md');
console.log(markdownContent);

ES Modules (ESM):

import { parseTXT, parseMarkdown } from 'uniparser';

// Synchronously read small text files
const txtContent = parseTXT('./path/to/sample-file.txt');
console.log(txtContent);

const markdownContent = parseMarkdown('./path/to/sample-file.md');
console.log(markdownContent);

πŸ”— Supported File Formats

  • πŸ“„ PDF (.pdf): Converts PDF documents to plain text.
  • πŸ“ DOCX (.docx): Extracts text from Microsoft Word .docx files.
  • πŸ–‹οΈ TXT (.txt): Reads plain text from simple text files.
  • 🌐 HTML (.html): Strips HTML tags and returns the text content.
  • ✍️ Markdown (.md): Converts Markdown files to plain text, removing all formatting.
  • πŸ”„ Auto-detection: Detects file types automatically via autoParse and processes them accordingly.

🎯 Example

Here's a quick example to get you started with DOCX parsing:

CommonJS (CJS):

const { parseDOCX } = require('uniparser');

(async () => {
    const docxText = await parseDOCX('./path/to/sample-file.docx');
    console.log(docxText);
})();

ES Modules (ESM):

import { parseDOCX } from 'uniparser';

(async () => {
    const docxText = await parseDOCX('./path/to/sample-file.docx');
    console.log(docxText);
})();

πŸ”‘ License

This project is licensed under the MIT License. See the LICENSE file for more information.


🀝 Contributing

Contributions are welcome! If you'd like to improve UniParser, feel free to fork the repository and submit a pull request. We appreciate your feedback and contributions!


πŸ’‘ UniParser makes it easier than ever to extract content from a wide range of file formatsβ€”Try it now and streamline your file processing tasks! 🌟

1.3.2

8 months ago

1.3.1

8 months ago

1.3.0

8 months ago

1.2.0

9 months ago

1.1.0

9 months ago

1.0.0

9 months ago