Needlebase Data Integration: Semantic Deduplication and Data Cleansing

Multi-source data—whether from websites, feeds, or private uploads—is inevitably riddled with redundant, incomplete, or mutually contradictory information.  Cleaning up those conflicts has typically required custom programming, which is expensive, brittle, and time-consuming.  By contrast, Needlebase's semantic deduplication algorithms empower your content team to clean the data efficiently on their own, with no programming required.  Needlebase helps by:

  • automatically mapping data from all sources into one consistent data model
  • automatically merging data items that agree on key properties (e.g., restaurants that share the same name and address)
  • automatically accumulating other properties (e.g., a restaurant's reviews) across all sources; and
  • automatically identifying clusters of similar items and proposing them as candidates for manual merging.

This diagram illustrates how normalization and semantic deduplication work together to reconcile data from multiple sources:

 

Below, a screenshot of Needlebase suggestions for deduplication in a restaurant database.  Needlebase provides a deduplication workflow with three levels of suggestions, requiring different degrees of manual inspection before merging: "same names", "similar names", and "related names".

 

Needlebase also makes it easy to merge, edit, or delete incorrect data manually.  Such edits are stored permanently in the Needlebase graphbase, linked directly to the affected data items, so that when the incorrect data is re-acquired from its original source, your corrections and deletions remain in effect.

Put together, Needlebase's suite of merging and cleansing tools result in dramatic productivity improvements.  Click here to learn how Needlebase reduced the labor required to clean one real-world dataset by a factor of more than 20, while improving the quality of the end result.

Needlebase's productivity improvements are made possible by its novel database architecture, the graphbase, which was built from the ground up for data reconciliation. For each database, a central data model governs the data layout, normalization rules, and automatic merging behavior of each data type. The data model is configured through a simple form UI (see below), and can be updated at any time.

 

 

Data Merging and Cleansing Features

Merging
  • automatically merges semantically equivalent data items from different sources into one unified item
  • guides the user through an efficient process for accepting or rejecting groups of semantically similar data items
  • makes manual deduplication a breeze with "drag-and-drop" merging
  • allows past merges to be reviewed and undone as needed, with no loss of data
Metadata
  • tracks the source for each piece of data in the system; with one click, database editors can jump directly to a cached copy of the item's source web page or feed record
  • allows data sources to be removed from the collection, without losing data merged from other sources
Editing
  • provides a convenient UI for correcting or deleting data
  • edits and deletions survive re-acquisition, so mistakes don't reappear when a data source is refreshed
  • edits and deletions can be undone at any time

Next: Needlebase Data Publishing

 

 

with a Google account


Explore sample
Needlebase domains

 

 

Mass Technology Leadership Council - 2010 Finalist

badge150x50-finalist

Follow needlebase on Twitter

Careers at ITA Software

Copyright © 2010-2011 ITA Software, Inc. · Careers · Contact · Terms of Use · Privacy