Similarity & Duplicate

Claro maintains a similarity graph across records using a combination of embedding similarity and rule-based matching. Duplicate detection, supplier match, variant grouping, and entity resolution all read from this graph.

What the graph contains

Every record in a catalogue is embedded across the attributes you select. The graph stores:

Pairwise similarity scores — between records in the same catalogue, and across catalogues you’ve connected.
Cluster memberships — high-confidence groups of records likely to be the same entity or close variants.
Match decisions — every approve/reject decision is remembered and used as positive or negative training signal for future matching.

The graph is updated continuously as records change, are added, or are removed.

The Similarity & Duplicate surface

The page is a single workspace for inspecting and acting on the graph.

Cluster browser — proposed clusters of likely duplicates or variants, sortable by size, average similarity, and confidence.
Pair view — for any two records, a side-by-side diff with attribute-level similarity scores.
Threshold controls — set the score above which proposals auto-merge, and the score below which they’re discarded.
Field weights — configure which attributes drive similarity (e.g. weight gtin and mpn higher than description).
Reasoning — for every proposed merge, the contributing fields and their individual scores.

How operations use the graph

Find Duplicates — surfaces clusters of likely duplicate records and produces merge proposals you can approve in bulk.
Find Similarities — surfaces broader similarity clusters useful for variant grouping, supplier match, and cross-catalogue linking.
Data Source Mapping — when an uploaded file lands as a Data Source, the graph is queried to match each row to existing records before merge.
Bulk Enrichment can pull values from similar records when filling gaps.

Merging and rejecting

When you approve a merge proposal:

The records are combined into a single record.
Conflicting attribute values resolve via configured rules (most-recent, highest-confidence, source priority, manual choice).
All references and history from both records are preserved on the survivor.
The merge is reversible from the record’s history.

When you reject a pair:

The pair is added to a negative-examples set used to suppress similar proposals in the future.
This is workspace-scoped and improves match precision over time.

Tuning matching

You’ll typically tune three things per catalogue:

Field weights — which attributes matter most for identity. gtin should dominate description.
Auto-merge threshold — high enough that auto-applied merges are virtually always correct.
Review threshold — the floor for proposals that queue in Notifications. Below this, proposals are discarded.

Defaults are conservative. Adjust them once you’ve reviewed a few hundred merges and have a sense of where the score distribution lands for your data.

​What the graph contains

​The Similarity & Duplicate surface

​How operations use the graph

​Merging and rejecting

​Tuning matching

What the graph contains

The Similarity & Duplicate surface

How operations use the graph

Merging and rejecting

Tuning matching