Needlebase Data Integration: Semantic Deduplication and Data Cleansing
Multi-source data—whether from websites, feeds, or private uploads—is inevitably riddled with redundant, incomplete, or mutually contradictory information. Cleaning up those conflicts has typically required custom programming, which is expensive, brittle, and time-consuming. By contrast, Needlebase's semantic deduplication algorithms empower your content team to clean the data efficiently on their own, with no programming required. Needlebase helps by:
- automatically mapping data from all sources into one consistent data model
- automatically merging data items that agree on key properties (e.g., restaurants that share the same name and address)
- automatically accumulating other properties (e.g., a restaurant's reviews) across all sources; and
- automatically identifying clusters of similar items and proposing them as candidates for manual merging.
This diagram illustrates how normalization and semantic deduplication work together to reconcile data from multiple sources:

Below, a screenshot of Needlebase suggestions for deduplication in a restaurant database. Needlebase provides a deduplication workflow with three levels of suggestions, requiring different degrees of manual inspection before merging: "same names", "similar names", and "related names".

Needlebase also makes it easy to merge, edit, or delete incorrect data manually. Such edits are stored permanently in the Needlebase graphbase, linked directly to the affected data items, so that when the incorrect data is re-acquired from its original source, your corrections and deletions remain in effect.
Put together, Needlebase's suite of merging and cleansing tools result in dramatic productivity improvements. Click here to learn how Needlebase reduced the labor required to clean one real-world dataset by a factor of more than 20, while improving the quality of the end result.
Needlebase's productivity improvements are made possible by its novel database architecture, the graphbase, which was built from the ground up for data reconciliation. For each database, a central data model governs the data layout, normalization rules, and automatic merging behavior of each data type. The data model is configured through a simple form UI (see below), and can be updated at any time.

Data Merging and Cleansing Features
|
Next: Needlebase Data Publishing



