4.0.1 • Published 6 months ago

@kessler/embedding v4.0.1

Weekly downloads
-
License
Apache-2.0
Repository
github
Last release
6 months ago

@kessler/embedding (WIP)

This module is built to allow progressive advancement from simple json based embedding database to more advanced solutions like chroma or redis.

quick start

import { loadProviders, Collection } from '@kessler/embedding'

async function main() {
  const { storage, embedders } = await loadProviders({ 
    fs: { directory: '/some/directory' },
    openai: { apiKey: 'openai key here' } 
  })

  const { fs } = storage
  const { openai } = embedders

  await fs.init()
  await openai.init()

  const collection = new Collection('test', openai, fs)
  
  await collection.add('hello world', { created: Date.now() })
  console.log(await collection.query('hello'))

  await fs.shutdown()
  await openai.shutdown()
}

main()

Collections

Collections are the highest abstraction layer. They group together documents, their embedding data and some optional metadata.

class Collection {
  constructor(name, embeddingService, storage) {}
  async query(text, { maxResults = Infinity, threshold = 0.8 }) {}
  async add(text, metadata) {}
  async delete(id) {}
  async get(id) {}
}

Providers

There are two categories for providers: embedding and storage. Embedding providers expose embedding services through a unified interface and storage providers do the same, just for storing and querying documents.

Providers can be loaded and created manually by importing their classes and instantiating them or they can be loaded through loadProviders (see below)

Once a provider is loaded you should call it's init method, regardless of wether you loaded it manually or through load providers. (TODO: i might want to change this behavior)

embedding provider

class Embedder {
  constructor(underlyingProvider, config) {}
  async exec(text, metadata) {}
  async init() {}
  async shutdown() {}
}

TODO: once a document is embedded with one service and stored, the embedding provider cannot be changed, if the embedding scheme is different in the new provider. This must be addressed some how in the design.

storage provider

class MyStorage {
  constructor(underlyingProvider, config) {}
  async query(collectionName, embedding, { maxResults, threshold }) {}
  async add(collectionName, content, embedding, metadata) {}
  async delete(collectionName, id) {}
  async get(collectionName, id) {}
  async init() {}
  async shutdown() {}
  async collections() {}
}

loading automatically

the intent of loadProviders is to load and instatiate any provider that can be loaded, meaning that their peer dependencies exist.

import { loadProviders } from '@kessler/embedding'

async function main() {
  const { storage, embedders } = await loadProviders({ /* ...providers config */ })
  const { pg } = storage
  const { openai } = embedders

  await pg.init()
  await openai.init()
}

main()

loading manually

TBD

embedding providers

openai embedder

Currently the only supported embedding service.

run npm install openai

import { loadProviders } from '@kessler/embedding'

async function main() {
  const { embedders, storage } = await loadProviders({ 
    openai: { apiKey: 'your-api-key' } 
  })

  const { openai } = embedders
  await openai.init()

  // do stuff
  await openai.shutdown()
}

main()

storage providers

File System storage

The simplest non optimized solution, collections are saved on the file system in json files.

Embedding is matched by going through all the existing documents, so not very scalable.

I have plans to implement a better algorithm in the future.

import { loadProviders } from '@kessler/embedding'

async function main() {
  const { embedders, storage } = await loadProviders({ 
    fs: { directory: '/some/path/to/embedding-db' },
  })

  const { fs } = storage
  await fs.init()

  // do stuff
  await fs.shutdown()
}

main()

Postgresql storage

Uses postgresql database with pgvector extension installed.

run npm install pg pgvector (mind the peer dependency versions)

import { loadProviders } from './index.mjs'

async function main() {

  const { embedders, storage } = await loadProviders({ 
    // there are defaults though, database "embedding", localhost, root and no password
    pg: {
      databaseConfig: {
        database: 'embedding',
        user: 'root',
        password: 'shhhhhhhhhhh'
      }
    }
  })
  
  const { pg } = storage
  await pg.init()
  
  // do stuff
  await pg.shutdown()
}

main()

Redis storage

TBD

Chroma storage

TBD

resources

4.0.1

6 months ago

4.0.0

6 months ago

3.0.2

8 months ago

3.0.1

8 months ago

3.0.0

8 months ago

2.0.1

9 months ago

2.0.0

9 months ago

1.1.0

9 months ago