Needlebase Data Acquisition: Web-Scraping and Beyond
Acquiring structured data from rich, deeply-linked websites and web data feeds has never been easier. Using Needlebase through an ordinary web browser, your data team can visually "tag" any website's contents with the fields of your desired data model (see illustration below). Given a few tagged examples, Needlebase learns both the navigational pattern and the data pattern of the website. Needlebase also imports Excel spreadsheets, CSV files, and XML feeds the same way. All data is automatically normalized and merged into your database, securely hosted in ITA Software's datacenter. The result? Rich content aggregations and data mash-ups that can be either exported to your on-premise database or published directly from the cloud to support consumer web and mobile applications.

Acquisition Features
| Trainable web-scraping |
- Import data from complex websites via a simple data-tagging interface. Tag whole fields, within a field, or across fields: after a few examples, Needlebase learns your pattern. No knowledge of programming, scripting, HTML DOM structure or regular expressions is required.
- Illuminate the "dark web": Needlebase easily fills forms and traverses paging and details links according to your tagging pattern.
- Handles most dynamic AJAX, Javascript and Web 2.0 site designs.
- Allows regular automatic updates to be scheduled for frequently updated sources.
- Detects, reports on, and allows quick recovery from website structure changes.
|
| Direct feed import |
- Imports data from XML, CSV, and Excel formatted files.
- Extracts data elements from whole fields or from within fields.
- Supports bulk data upload via a secure API.
|
| Data normalization |
- Normalizes common data types including dates, times, names, titles, numbers, URLs, phone numbers, and prices.
- Geocodes addresses (U.S., Canada, Europe, Australia).
- Automatically restructures all extracted data into a consistent target data model, no matter how it was organized at the source.
|
|
Next: Needlebase Data Integration