Duplicate Record Checker
An overview of the AI Vector Dupe package. A visual of this package can be found here
AI Vector Dupe Documentation
**AI Vector Dupe ** is a package designed to identify duplicate records in a database by generating vector representations and finding similar vectors. Users can then take actions, such as merging or deleting the detected duplicates.
Prerequisites
Before using the package, ensure the following requirements are met:
-
SQL Server with MemberJunction Framework
MemberJunction Documentation -
Embedding Model API Key
Supported embedding models include OpenAI, Mistral, and others supported by MemberJunction. -
Vector Database API Key
Currently, only Pinecone is supported for vector storage.
How to Run the Package
Follow these steps to use the AI Vector Dupe package:
-
Load Required Packages
Ensure this package, along with your embedding and vector database packages, is loaded into your application. Verify they are not tree-shaken out. -
Prepare Records
Create a list of records to search for duplicates. Note: Currently, this package supports finding duplicates within the same entity. Support for cross-entity duplicate checks is planned for future updates. -
Call the
getDuplicateRecords
Function
Create an instance of theDuplicateRecordDetector
class and call thegetDuplicateRecords
function with the following parameters:Parameter Type Description listID
string
The ID of the list containing the records to analyze. entityID
string
The ID of the entity the records belong to. probabilityScore
number
(optional)The minimum similarity score to consider a record as a potential duplicate. Return: A
Promise
that resolves after processing. For large datasets, it is recommended not toawait
the result.
Workflow: getDuplicateRecords
Function
getDuplicateRecords
FunctionThe getDuplicateRecords
function performs the following steps:
-
Fetch Records
Fetches the list bylistID
and retrieves all records contained within it. -
Generate or Fetch Vectors
- If configured, generates new vectors for all records associated with the specified
entityID
and upserts them into the vector database. - If not configured to upsert new vectors, it queries the vector database to fetch existing vectors for the records.
- If configured, generates new vectors for all records associated with the specified
-
Search for Similar Vectors
For each vector, queries the vector database to find n similar vectors (where n is user-specified). -
Fetch Related Records
Fetches database records corresponding to the similar vectors retrieved. -
Merge Duplicates (Optional)
If configured, merges records marked as duplicates into the source record based on a similarity probability threshold.- Example: If the similarity score exceeds
0.95
, the record is merged.
- Example: If the similarity score exceeds
-
Track Results
Records are created in the database to log:- The duplicate record search run.
- Which records were analyzed.
- Which records were marked as potential duplicates.
Example Usage
Here is an example of how to use the package:
const { DuplicateRecordDetector } = require('ai-vector-dupe');
// Create an instance of the DuplicateRecordDetector
const detector = new DuplicateRecordDetector();
// Call getDuplicateRecords
detector.getDuplicateRecords({
listID: 'example-list-id',
entityID: 'example-entity-id',
probabilityScore: 0.9
});
Updated 1 day ago