2.48.0 • Published 4 months ago

@memberjunction/ai-vector-dupe v2.48.0

Weekly downloads
-
License
ISC
Repository
-
Last release
4 months ago

@memberjunction/ai-vector-dupe

A MemberJunction package for identifying and managing duplicate records using AI-powered vector similarity search. This package generates vector representations of records and uses similarity scoring to detect potential duplicates, with options for automatic merging.

Overview

The AI Vector Dupe package provides sophisticated duplicate detection capabilities by:

  • Converting records into vector embeddings using AI models
  • Performing similarity searches in vector databases
  • Tracking duplicate detection runs and results
  • Optionally merging duplicates based on configurable thresholds

Installation

npm install @memberjunction/ai-vector-dupe

Prerequisites

  1. MemberJunction Framework: A properly configured MemberJunction database with the core schema
  2. AI Model Provider: API key for embedding models (OpenAI, Mistral, or other supported providers)
  3. Vector Database: Currently supports Pinecone with appropriate API credentials
  4. Entity Documents: Configured entity documents with templates for the entities you want to analyze

Core Components

DuplicateRecordDetector

The main class that handles duplicate detection operations.

import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
import { PotentialDuplicateRequest, UserInfo } from '@memberjunction/core';

const detector = new DuplicateRecordDetector();

VectorSyncBase

Abstract base class providing utilities for vector synchronization operations.

import { VectorSyncBase } from '@memberjunction/ai-vector-dupe';

EntitySyncConfig

Type definition for entity synchronization configuration.

import { EntitySyncConfig } from '@memberjunction/ai-vector-dupe';

const config: EntitySyncConfig = {
    EntityDocumentID: 'entity-doc-id',
    Interval: 3600,
    RunViewParams: { /* RunView parameters */ },
    IncludeInSync: true,
    LastRunDate: 'January 1, 2024 00:00:00',
    VectorIndexID: 1,
    VectorID: 1
};

Usage

Basic Duplicate Detection

import { DuplicateRecordDetector } from '@memberjunction/ai-vector-dupe';
import { PotentialDuplicateRequest, UserInfo } from '@memberjunction/core';

// Initialize the detector
const detector = new DuplicateRecordDetector();

// Define the request parameters
const request: PotentialDuplicateRequest = {
    ListID: 'your-list-id',           // ID of the list containing records to check
    EntityID: 'your-entity-id',        // ID of the entity type
    EntityDocumentID: 'doc-id',        // ID of the entity document with template
    Options: {
        DuplicateRunID: 'run-id'       // Optional: existing duplicate run to continue
    }
};

// Execute duplicate detection
const response = await detector.getDuplicateRecords(request, currentUser);

if (response.Status === 'Success') {
    console.log(`Found ${response.PotentialDuplicateResult.length} records with potential duplicates`);
    
    for (const result of response.PotentialDuplicateResult) {
        console.log(`Record ${result.RecordCompositeKey.ToString()}:`);
        for (const duplicate of result.Duplicates) {
            console.log(`  - Potential duplicate: ${duplicate.ToString()} (${duplicate.ProbabilityScore * 100}% match)`);
        }
    }
}

Advanced Configuration

// Configure thresholds via Entity Document settings
// PotentialMatchThreshold: Minimum score to consider as potential duplicate (e.g., 0.8)
// AbsoluteMatchThreshold: Score at which automatic merging occurs (e.g., 0.95)

const entityDocument = await vectorizer.GetEntityDocument(entityDocumentID);
entityDocument.PotentialMatchThreshold = 0.8;  // 80% similarity
entityDocument.AbsoluteMatchThreshold = 0.95;   // 95% for auto-merge
await entityDocument.Save();

API Reference

DuplicateRecordDetector

getDuplicateRecords(params: PotentialDuplicateRequest, contextUser?: UserInfo): Promise<PotentialDuplicateResponse>

Performs duplicate detection on records in a list.

Parameters:

  • params: Request parameters including:
    • ListID: ID of the list containing records to analyze
    • EntityID: ID of the entity type
    • EntityDocumentID: ID of the entity document configuration
    • Options: Optional configuration including DuplicateRunID
  • contextUser: Optional user context for permissions

Returns: PotentialDuplicateResponse containing:

  • Status: 'Success' or 'Error'
  • ErrorMessage: Error details if failed
  • PotentialDuplicateResult[]: Array of results for each analyzed record

VectorSyncBase

Base class providing utility methods:

  • parseStringTemplate(str: string, obj: any): string - Parse template strings
  • timer(ms: number): Promise<unknown> - Async delay utility
  • start() / end() / timeDiff() - Timing utilities
  • saveJSONData(data: any, path: string) - JSON file operations

Workflow Details

The duplicate detection process follows these steps:

  1. Vectorization: Records are converted to vector embeddings using the configured AI model
  2. Similarity Search: Each vector is compared against others in the vector database
  3. Threshold Filtering: Results are filtered based on the potential match threshold
  4. Result Tracking: All operations are logged in duplicate run tables
  5. Optional Merging: Records exceeding the absolute match threshold are automatically merged

Database Schema Integration

The package integrates with these MemberJunction entities:

  • Duplicate Runs: Master record for each duplicate detection execution
  • Duplicate Run Details: Individual record analysis results
  • Duplicate Run Detail Matches: Specific duplicate matches found
  • Lists: Source lists containing records to analyze
  • List Details: Individual records within lists
  • Entity Documents: Configuration for entity vectorization

Configuration

Environment Variables

Create a .env file with:

# AI Model Configuration
OPENAI_API_KEY=your-openai-key
MISTRAL_API_KEY=your-mistral-key

# Vector Database
PINECONE_API_KEY=your-pinecone-key
PINECONE_HOST=your-pinecone-host
PINECONE_DEFAULT_INDEX=your-index-name

# Database Connection
DB_HOST=your-sql-server
DB_PORT=1433
DB_USERNAME=your-username
DB_PASSWORD=your-password
DB_DATABASE=your-database

# User Context
CURRENT_USER_EMAIL=user@example.com

Entity Document Templates

Entity documents use template syntax to define how records are converted to text for vectorization:

// Example template
const template = "${FirstName} ${LastName} works at ${Company} as ${Title}";

Dependencies

  • @memberjunction/ai: AI model abstractions
  • @memberjunction/ai-vectordb: Vector database interfaces
  • @memberjunction/ai-vectors: Vector operations
  • @memberjunction/ai-vectors-pinecone: Pinecone implementation
  • @memberjunction/ai-vector-sync: Entity vectorization
  • @memberjunction/core: Core MJ functionality
  • @memberjunction/core-entities: Entity definitions

Best Practices

  1. Batch Processing: For large datasets, process records in batches to avoid timeouts
  2. Threshold Tuning: Start with conservative thresholds and adjust based on results
  3. Template Design: Create comprehensive templates that capture all relevant fields
  4. Regular Sync: Keep vector databases synchronized with source data
  5. Monitor Performance: Track processing times and optimize for large datasets

Error Handling

The package provides detailed error messages for common issues:

try {
    const response = await detector.getDuplicateRecords(request, user);
    if (response.Status === 'Error') {
        console.error('Duplicate detection failed:', response.ErrorMessage);
    }
} catch (error) {
    console.error('Unexpected error:', error.message);
}

Limitations

  • Currently supports duplicate detection within a single entity type only
  • Requires pre-configured entity documents with templates
  • Vector database support limited to Pinecone
  • Performance depends on vector database query capabilities

Future Enhancements

  • Cross-entity duplicate detection
  • Additional vector database providers
  • Batch processing improvements
  • Real-time duplicate prevention
  • Advanced merge strategies

Support

For issues, questions, or contributions, please refer to the MemberJunction documentation or contact the development team.

2.27.1

8 months ago

2.23.2

9 months ago

2.46.0

5 months ago

2.23.1

9 months ago

2.27.0

8 months ago

2.34.0

6 months ago

2.30.0

7 months ago

2.19.4

9 months ago

2.19.5

9 months ago

2.19.2

9 months ago

2.19.3

9 months ago

2.19.0

9 months ago

2.19.1

9 months ago

2.15.2

9 months ago

2.34.2

6 months ago

2.15.0

9 months ago

2.34.1

6 months ago

2.15.1

9 months ago

2.38.0

5 months ago

2.45.0

5 months ago

2.22.1

9 months ago

2.22.0

9 months ago

2.41.0

5 months ago

2.22.2

9 months ago

2.26.1

8 months ago

2.26.0

8 months ago

2.33.0

6 months ago

2.18.3

9 months ago

2.18.1

9 months ago

2.18.2

9 months ago

2.18.0

9 months ago

2.37.1

6 months ago

2.37.0

6 months ago

2.14.0

10 months ago

2.21.0

9 months ago

2.44.0

5 months ago

2.40.0

5 months ago

2.29.0

8 months ago

2.29.2

8 months ago

2.29.1

8 months ago

2.25.0

8 months ago

2.48.0

4 months ago

2.32.0

7 months ago

2.32.2

7 months ago

2.32.1

7 months ago

2.17.0

9 months ago

2.13.4

10 months ago

2.36.0

6 months ago

2.13.2

11 months ago

2.13.3

10 months ago

2.13.0

11 months ago

2.36.1

6 months ago

2.13.1

11 months ago

2.43.0

5 months ago

2.20.2

9 months ago

2.20.3

9 months ago

2.20.0

9 months ago

2.20.1

9 months ago

2.28.0

8 months ago

2.47.0

5 months ago

2.24.1

8 months ago

2.24.0

8 months ago

2.31.0

7 months ago

2.12.0

12 months ago

2.39.0

5 months ago

2.16.1

9 months ago

2.35.1

6 months ago

2.35.0

6 months ago

2.16.0

9 months ago

2.42.1

5 months ago

2.42.0

5 months ago

2.23.0

9 months ago

2.11.0

12 months ago

2.10.0

1 year ago

2.9.0

1 year ago

2.8.0

1 year ago

2.7.0

1 year ago

2.7.1

1 year ago

2.6.1

1 year ago

2.6.0

1 year ago

2.5.2

1 year ago

1.8.1

1 year ago

1.8.0

1 year ago

1.6.1

1 year ago

1.6.0

1 year ago

1.4.1

1 year ago

1.4.0

1 year ago

2.2.1

1 year ago

2.2.0

1 year ago

2.4.1

1 year ago

2.4.0

1 year ago

2.0.0

1 year ago

1.7.1

1 year ago

1.5.3

1 year ago

1.7.0

1 year ago

1.5.2

1 year ago

1.5.1

1 year ago

1.3.3

1 year ago

1.5.0

1 year ago

1.3.2

1 year ago

1.3.1

1 year ago

1.3.0

1 year ago

2.3.0

1 year ago

2.1.2

1 year ago

2.1.1

1 year ago

2.5.0

1 year ago

2.3.2

1 year ago

2.1.4

1 year ago

2.3.1

1 year ago

2.1.3

1 year ago

2.5.1

1 year ago

2.3.3

1 year ago

2.1.5

1 year ago

2.1.0

1 year ago

1.2.2

1 year ago

1.2.1

1 year ago

1.2.0

1 year ago

1.1.1

1 year ago

1.1.0

1 year ago

1.1.3

1 year ago

1.1.2

1 year ago

1.0.11

2 years ago

1.0.9

2 years ago

1.0.8

2 years ago

1.0.7

2 years ago

1.0.8-next.6

2 years ago

1.0.8-next.5

2 years ago

1.0.8-next.4

2 years ago

1.0.8-next.3

2 years ago

1.0.8-next.2

2 years ago

1.0.8-next.1

2 years ago

1.0.8-next.0

2 years ago

1.0.7-next.0

2 years ago

1.0.8-beta.0

2 years ago

1.0.2

2 years ago

1.0.6

2 years ago

1.0.4

2 years ago

1.0.3

2 years ago

1.0.1

2 years ago

1.0.0

2 years ago

0.9.34

2 years ago

0.9.36

2 years ago

0.9.37

2 years ago

0.9.38

2 years ago

0.9.32

2 years ago

0.9.33

2 years ago

0.9.31

2 years ago

0.9.30

2 years ago

0.9.27

2 years ago

0.9.28

2 years ago

0.9.29

2 years ago

0.9.24

2 years ago

0.9.25

2 years ago

0.9.26

2 years ago

0.9.23

2 years ago

0.9.21

2 years ago

0.9.20

2 years ago

0.9.19

2 years ago

0.9.15

2 years ago

0.9.16

2 years ago

0.9.17

2 years ago

0.9.18

2 years ago

0.9.12

2 years ago

0.9.13

2 years ago

0.9.14

2 years ago

0.9.10

2 years ago

0.9.9

2 years ago

0.9.8

2 years ago

0.9.7

2 years ago

0.9.5

2 years ago

0.9.4

2 years ago

0.9.3

2 years ago

0.9.2

2 years ago