Needle Data Integration: Semantic Deduplication and Data Cleansing

Multi-source data—whether from websites, feeds, or private uploads—is inevitably riddled with redundant, incomplete, or mutually contradictory information.  Cleaning up those conflicts has typically required custom programming, which is expensive, brittle, and time-consuming.  By contrast, Needle's semantic deduplication algorithms empower your content team to clean the data efficiently on their own, with no programming required.  Needle helps by:

  • automatically mapping data from all sources into one consistent data model
  • automatically merging data items that agree on key properties (e.g., restaurants that share the same name and address)
  • automatically accumulating other properties (e.g., a restaurant's reviews) across all sources; and
  • automatically identifying clusters of similar items and proposing them as candidates for manual merging.

This diagram illustrates how normalization and semantic deduplication work together to reconcile data from multiple sources:

 

Below, a screenshot of Needle suggestions for deduplication in a restaurant database.  Needle provides a deduplication workflow with three levels of suggestions, requiring different degrees of manual inspection before merging: "same names", "similar names", and "related names".

 

Needle also makes it easy to merge, edit, or delete incorrect data manually.  Such edits are stored permanently in the Needle graphbase, linked directly to the affected data items, so that when the incorrect data is re-acquired from its original source, your corrections and deletions remain in effect.

Put together, Needle's suite of merging and cleansing tools result in dramatic productivity improvements.  Click here to learn how Needle reduced the labor required to clean one real-world dataset by a factor of more than 20, while improving the quality of the end result.

Needle's productivity improvements are made possible by its novel database architecture, the graphbase, which was built from the ground up for data reconciliation. For each domain, a central data model governs the data layout, normalization rules, and automatic merging behavior of each data type. The data model is configured through a simple form UI (see below), and can be updated at any time.

 

 

Data Merging and Cleansing Features

Merging
  • automatically merges semantically equivalent data items from different sources into one unified item
  • guides the user through an efficient process for accepting or rejecting groups of semantically similar data items
  • makes manual deduplication a breeze with "drag-and-drop" merging
  • allows past merges to be reviewed and undone as needed, with no loss of data
Metadata   
  • tracks the source for each piece of data in the system; with one click, domain editors can jump directly to a cached copy of the item's source web page or feed record
  • allows data sources to be removed from the collection, without losing data merged from other sources
Editing
  • provides a convenient UI for correcting or deleting data
  • edits and deletions survive re-acquisition, so mistakes don't reappear when a data source is refreshed
  • edits and deletions can be undone at any time

Next: Needle Data Publishing


 

Join the list for our
limited preview release


Already using Needle?
Get to work!


Explore sample
Needle domains

 

 

badge150x50-finalist
 
Follow needlebase on Twitter

Copyright © 2010 ITA Software, Inc. · Terms of Use