ozymandias_osiris v1.0.15
Osiris - Document Ingestion Pipeline
Osiris is a powerful document ingestion pipeline designed to process content into vector embeddings and store them in a Qdrant vector database. It's built to work seamlessly with the Ibis chat application.
Prerequisites
- Node.js (v18 or higher)
- npm or yarn
- A Qdrant instance (local or cloud)
- OpenAI API key
Installation
Clone the repository:
git clone <repository-url> cd osiris
Install dependencies:
npm install
Create a .env file in the root directory:
OpenAI API key for embeddings generation
OPENAI_API_KEY=your_openai_api_key
Qdrant settings
QDRANT_URL=your_qdrant_url # e.g., http://localhost:6333 or your cloud URL QDRANT_API_KEY=your_qdrant_api_key # Optional for local, required for cloud
Preparing Your Data
Create JSON files containing your documents. Each JSON file should follow this structure:
{
"title": "Document Title",
"content": "Document content goes here...",
"metadata": {
"source": "optional source information",
"author": "optional author information",
"date": "optional date information"
}
}
Store your JSON files in a directory (e.g., ./data).
Usage
Build the project:
npm run build
Run the ingestion pipeline:
node dist/index.js ingest <directory> [options]
Options:
- --collection - Collection name (default: 'website_content')
- --batch-size - Batch size for processing (default: 100)
- --max-retries - Max retries for failed operations (default: 3)
- --max-concurrent - Max concurrent operations (default: 5)
Example
node dist/index.js ingest ./data --collection documents --batch-size 50
or add this in .zshrc
# Osiris data ingestion function
osiris() {
# Show help if no arguments provided
if [ -z "$1" ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
echo "Usage: osiris <command> [options]"
echo ""
echo "Commands:"
echo " ingest <directory> -c <collection> -g <group-id> # Ingest content"
echo " health # Check system health"
echo " clean <collection> # Clean collection"
echo " delete-by-group -c <collection> -g <group-id> # Delete by group"
echo ""
echo "Examples:"
echo " osiris ingest ./content -c my-collection -g client1"
echo " osiris health"
echo " osiris clean my-collection"
echo " osiris delete-by-group -c my-collection -g client1"
return 1
fi
# Get the directory of the script
local OSIRIS_PATH="/users/ivan/sites/ozymandias/osiris"
# If the first argument looks like a path and not a command, insert 'ingest'
if [[ "$1" != "health" && "$1" != "clean" && "$1" != "delete-by-group" ]]; then
set -- "ingest" "$@"
fi
# Run the command using tsx instead of node
cd "$OSIRIS_PATH" && npx tsx src/index.ts "$@"
}
and then run
# Ingest content
osiris ./content -c collection-name -g client1
# Check health
osiris health
# Clean collection
osiris clean my-collection
# Delete by group
osiris delete-by-group -c my-collection -g client1
Features
- Content Validation: Validates JSON files and their content structure
- Text Chunking: Intelligently splits documents into appropriate chunks
- Embedding Generation: Generates embeddings using OpenAI's API
- Vector Storage: Stores embeddings in Qdrant vector database
- Progress Tracking: Shows real-time progress and statistics
- Error Handling: Robust error handling with retries
- Concurrent Processing: Efficient parallel processing of documents
Monitoring
The ingestion process provides real-time feedback:
- Progress of file processing
- Number of chunks generated
- Embedding generation progress
- Success/failure statistics
Error Handling
Errors are logged with detailed information. Failed operations are automatically retried based on the --max-retries setting.
Integration with Ibis
Osiris is designed to work with the Ibis chat application. Make sure to:
- Use the same Qdrant instance in both applications
- Set the collection name to match Ibis's configuration (default: 'documents')
Development
Run tests:
npm run test
Watch mode:
npm run test:watch
Generate coverage report:
npm run coverage