1.0.1 β’ Published 6 months ago
content-grabber-alvamind v1.0.1
ποΈ content-grabber-alvamind
A Node.js library to extract text content from various file types. πͺ
β¨ Features
- Versatile File Support: Extracts text from
.txt
,.pdf
,.docx
,.csv
, and.xlsx
files. π - Local & Remote Files: Works with both local file paths and URLs. π
- Intelligent Content Type Handling: Automatically detects content types from headers and file extensions. π€
- PDF Text Extraction: Extracts text from PDF files, with optional OCR support. π§
- Configurable OCR: Control OCR behavior (scale, languages). βοΈ
- Customizable Logging: Supports custom logger for info, error and debug messages. πͺ΅
- Error Handling: Provides descriptive error messages. β οΈ
- Easy to Use: Simple API for quick integration into your projects. π
π― Benefits
- Simplify Data Extraction: Quickly grab text from different file types. β±οΈ
- Save Time: No need to handle file formats manually. β³
- Improve Productivity: Focus on processing text rather than parsing files. π
- Reliable: Robust and well-tested. β
π¦ Installation
npm install content-grabber-alvamind
π οΈ Usage
Basic Example
import { fetchFileContent } from 'content-grabber-alvamind';
async function main() {
try {
const fileUrl = 'path/to/your/document.pdf'; // Replace with your file URL/path
const extractedContent = await fetchFileContent(fileUrl);
console.log(extractedContent);
} catch (error) {
console.error('Error:', error);
}
}
main();
PDF with OCR
import { fetchFileContent } from 'content-grabber-alvamind';
async function main() {
try {
const fileUrl = 'path/to/your/scanned_document.pdf';
const extractedContent = await fetchFileContent(fileUrl, {
pdfOptions: {
ocrEnabled: true, // Enable OCR for scanned PDFs
languages: ['eng', 'spa'], // Specify OCR languages
scale: 2.5 // increase scale for better OCR quality
}
});
console.log(extractedContent);
} catch (error) {
console.error("Error:", error);
}
}
main();
Custom Logger Example
import { fetchFileContent, FileContentExtractionOptions } from 'content-grabber-alvamind';
class CustomLogger {
info(message: string, ...args: any[]): void {
console.log(`[CUSTOM INFO] ${message}`, ...args);
}
error(message: string, ...args: any[]): void {
console.error(`[CUSTOM ERROR] ${message}`, ...args);
}
debug(message: string, ...args: any[]): void {
console.debug(`[CUSTOM DEBUG] ${message}`, ...args);
}
}
async function main() {
try {
const fileUrl = 'path/to/your/document.txt';
const options: FileContentExtractionOptions = {
logger: new CustomLogger()
}
const extractedContent = await fetchFileContent(fileUrl, options);
console.log(extractedContent);
} catch (error) {
console.error('Error:', error);
}
}
main();
DOCX Extraction
import { fetchFileContent } from 'content-grabber-alvamind';
async function main() {
try {
const fileUrl = 'path/to/your/document.docx'; // Replace with your file URL/path
const extractedContent = await fetchFileContent(fileUrl);
console.log(extractedContent);
} catch (error) {
console.error('Error:', error);
}
}
main();
CSV Extraction
import { fetchFileContent } from 'content-grabber-alvamind';
async function main() {
try {
const fileUrl = 'path/to/your/data.csv'; // Replace with your file URL/path
const extractedContent = await fetchFileContent(fileUrl);
console.log(extractedContent);
} catch (error) {
console.error('Error:', error);
}
}
main();
Excel Extraction
import { fetchFileContent } from 'content-grabber-alvamind';
async function main() {
try {
const fileUrl = 'path/to/your/data.xlsx'; // Replace with your file URL/path
const extractedContent = await fetchFileContent(fileUrl);
console.log(extractedContent);
} catch (error) {
console.error('Error:', error);
}
}
main();
API
fetchFileContent(fileUrl: string, options?: FileContentExtractionOptions): Promise<string>
fileUrl
(string): The URL or local file path of the document to extract text from.options
(object, optional): An object containing optional configurations:pdfOptions
(object, optional): Configuration for PDF extraction:ocrEnabled
(boolean, optional): Enable OCR extraction. Defaulttrue
.scale
(number, optional): Scale factor for OCR image. Default2.0
.languages
(string[], optional): Array of OCR languages (e.g.,['eng', 'spa']
). Default['eng']
.minTextLength
(number, optional): Minimum length of normal text to consider using OCR. Default50
.
logger
: Custom logger object that implementsinfo
,error
anddebug
methods
- Returns: A
Promise
that resolves with the extracted text content or throws an error.
π£οΈ Roadmap
- Support for more file types (e.g.,
.odt
,.rtf
). - Improved OCR accuracy and performance.
- Configurable text extraction strategies.
- Add unit tests.
- More advanced logging options.
π€ Contributing
Contributions are welcome! Feel free to submit issues, feature requests, and pull requests on GitHub. π
Hereβs how you can help:
- Report bugs. π
- Suggest new features. π‘
- Improve documentation. βοΈ
- Submit code changes. π»
π Support the Project
If you find this project useful, consider supporting its development! You can contribute through:
- GitHub Sponsors: βοΈ Link to GitHub Sponsors
- Donations: π° Link to Donation Platform
Your support keeps this project going! π
π License
1.0.1
6 months ago