1.0.15 • Published 6 months ago

ozymandias_osiris v1.0.15

Weekly downloads
-
License
ISC
Repository
github
Last release
6 months ago

Osiris - Document Ingestion Pipeline

Osiris is a powerful document ingestion pipeline designed to process content into vector embeddings and store them in a Qdrant vector database. It's built to work seamlessly with the Ibis chat application.

Prerequisites

  • Node.js (v18 or higher)
  • npm or yarn
  • A Qdrant instance (local or cloud)
  • OpenAI API key

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd osiris
  2. Install dependencies:

    npm install
  3. Create a .env file in the root directory:

OpenAI API key for embeddings generation

OPENAI_API_KEY=your_openai_api_key

Qdrant settings

QDRANT_URL=your_qdrant_url # e.g., http://localhost:6333 or your cloud URL QDRANT_API_KEY=your_qdrant_api_key # Optional for local, required for cloud

Preparing Your Data

Create JSON files containing your documents. Each JSON file should follow this structure:

{
  "title": "Document Title",
  "content": "Document content goes here...",
  "metadata": {
    "source": "optional source information",
    "author": "optional author information",
    "date": "optional date information"
  }
}

Store your JSON files in a directory (e.g., ./data).

Usage

  1. Build the project:

    npm run build
  2. Run the ingestion pipeline:

    node dist/index.js ingest <directory> [options]

Options:

  • --collection - Collection name (default: 'website_content')
  • --batch-size - Batch size for processing (default: 100)
  • --max-retries - Max retries for failed operations (default: 3)
  • --max-concurrent - Max concurrent operations (default: 5)

Example

node dist/index.js ingest ./data --collection documents --batch-size 50

or add this in .zshrc

# Osiris data ingestion function
osiris() {
  # Show help if no arguments provided
  if [ -z "$1" ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
    echo "Usage: osiris <command> [options]"
    echo ""
    echo "Commands:"
    echo "  ingest <directory> -c <collection> -g <group-id>  # Ingest content"
    echo "  health                                           # Check system health"
    echo "  clean <collection>                               # Clean collection"
    echo "  delete-by-group -c <collection> -g <group-id>    # Delete by group"
    echo ""
    echo "Examples:"
    echo "  osiris ingest ./content -c my-collection -g client1"
    echo "  osiris health"
    echo "  osiris clean my-collection"
    echo "  osiris delete-by-group -c my-collection -g client1"
    return 1
  fi

  # Get the directory of the script
  local OSIRIS_PATH="/users/ivan/sites/ozymandias/osiris"

  # If the first argument looks like a path and not a command, insert 'ingest'
  if [[ "$1" != "health" && "$1" != "clean" && "$1" != "delete-by-group" ]]; then
    set -- "ingest" "$@"
  fi

  # Run the command using tsx instead of node
  cd "$OSIRIS_PATH" && npx tsx src/index.ts "$@"
}

and then run

# Ingest content
osiris ./content -c collection-name -g client1

# Check health
osiris health

# Clean collection
osiris clean my-collection

# Delete by group
osiris delete-by-group -c my-collection -g client1

Features

  • Content Validation: Validates JSON files and their content structure
  • Text Chunking: Intelligently splits documents into appropriate chunks
  • Embedding Generation: Generates embeddings using OpenAI's API
  • Vector Storage: Stores embeddings in Qdrant vector database
  • Progress Tracking: Shows real-time progress and statistics
  • Error Handling: Robust error handling with retries
  • Concurrent Processing: Efficient parallel processing of documents

Monitoring

The ingestion process provides real-time feedback:

  • Progress of file processing
  • Number of chunks generated
  • Embedding generation progress
  • Success/failure statistics

Error Handling

Errors are logged with detailed information. Failed operations are automatically retried based on the --max-retries setting.

Integration with Ibis

Osiris is designed to work with the Ibis chat application. Make sure to:

  • Use the same Qdrant instance in both applications
  • Set the collection name to match Ibis's configuration (default: 'documents')

Development

Run tests:

npm run test

Watch mode:

npm run test:watch

Generate coverage report:

npm run coverage
1.0.15

6 months ago

1.0.14

6 months ago

1.0.13

6 months ago

1.0.12

6 months ago

1.0.11

6 months ago

1.0.10

6 months ago

1.0.9

6 months ago

1.0.8

6 months ago

1.0.7

7 months ago

1.0.6

7 months ago

1.0.5

7 months ago

1.0.4

7 months ago

1.0.3

7 months ago

1.0.2

7 months ago

1.0.0

7 months ago