1.0.15 • Published 10 months ago

ozymandias_osiris v1.0.15

Weekly downloads
-
License
ISC
Repository
github
Last release
10 months ago

Osiris - Document Ingestion Pipeline

Osiris is a powerful document ingestion pipeline designed to process content into vector embeddings and store them in a Qdrant vector database. It's built to work seamlessly with the Ibis chat application.

Prerequisites

  • Node.js (v18 or higher)
  • npm or yarn
  • A Qdrant instance (local or cloud)
  • OpenAI API key

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd osiris
  2. Install dependencies:

    npm install
  3. Create a .env file in the root directory:

OpenAI API key for embeddings generation

OPENAI_API_KEY=your_openai_api_key

Qdrant settings

QDRANT_URL=your_qdrant_url # e.g., http://localhost:6333 or your cloud URL QDRANT_API_KEY=your_qdrant_api_key # Optional for local, required for cloud

Preparing Your Data

Create JSON files containing your documents. Each JSON file should follow this structure:

{
  "title": "Document Title",
  "content": "Document content goes here...",
  "metadata": {
    "source": "optional source information",
    "author": "optional author information",
    "date": "optional date information"
  }
}

Store your JSON files in a directory (e.g., ./data).

Usage

  1. Build the project:

    npm run build
  2. Run the ingestion pipeline:

    node dist/index.js ingest <directory> [options]

Options:

  • --collection - Collection name (default: 'website_content')
  • --batch-size - Batch size for processing (default: 100)
  • --max-retries - Max retries for failed operations (default: 3)
  • --max-concurrent - Max concurrent operations (default: 5)

Example

node dist/index.js ingest ./data --collection documents --batch-size 50

or add this in .zshrc

# Osiris data ingestion function
osiris() {
  # Show help if no arguments provided
  if [ -z "$1" ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    echo "Usage: osiris <command> [options]"
    echo ""
    echo "Commands:"
    echo "  ingest <directory> -c <collection> -g <group-id>  # Ingest content"
    echo "  health                                           # Check system health"
    echo "  clean <collection>                               # Clean collection"
    echo "  delete-by-group -c <collection> -g <group-id>    # Delete by group"
    echo ""
    echo "Examples:"
    echo "  osiris ingest ./content -c my-collection -g client1"
    echo "  osiris health"
    echo "  osiris clean my-collection"
    echo "  osiris delete-by-group -c my-collection -g client1"
    return 1
  fi

  # Get the directory of the script
  local OSIRIS_PATH="/users/ivan/sites/ozymandias/osiris"

  # If the first argument looks like a path and not a command, insert 'ingest'
  if [[ "$1" != "health" && "$1" != "clean" && "$1" != "delete-by-group" ]]; then
    set -- "ingest" "$@"
  fi

  # Run the command using tsx instead of node
  cd "$OSIRIS_PATH" && npx tsx src/index.ts "$@"
}

and then run

# Ingest content
osiris ./content -c collection-name -g client1

# Check health
osiris health

# Clean collection
osiris clean my-collection

# Delete by group
osiris delete-by-group -c my-collection -g client1

Features

  • Content Validation: Validates JSON files and their content structure
  • Text Chunking: Intelligently splits documents into appropriate chunks
  • Embedding Generation: Generates embeddings using OpenAI's API
  • Vector Storage: Stores embeddings in Qdrant vector database
  • Progress Tracking: Shows real-time progress and statistics
  • Error Handling: Robust error handling with retries
  • Concurrent Processing: Efficient parallel processing of documents

Monitoring

The ingestion process provides real-time feedback:

  • Progress of file processing
  • Number of chunks generated
  • Embedding generation progress
  • Success/failure statistics

Error Handling

Errors are logged with detailed information. Failed operations are automatically retried based on the --max-retries setting.

Integration with Ibis

Osiris is designed to work with the Ibis chat application. Make sure to:

  • Use the same Qdrant instance in both applications
  • Set the collection name to match Ibis's configuration (default: 'documents')

Development

Run tests:

npm run test

Watch mode:

npm run test:watch

Generate coverage report:

npm run coverage
1.0.15

10 months ago

1.0.14

10 months ago

1.0.13

10 months ago

1.0.12

10 months ago

1.0.11

10 months ago

1.0.10

10 months ago

1.0.9

10 months ago

1.0.8

10 months ago

1.0.7

10 months ago

1.0.6

10 months ago

1.0.5

10 months ago

1.0.4

10 months ago

1.0.3

10 months ago

1.0.2

10 months ago

1.0.0

10 months ago