Barcamp Demo: Scraping and Analyzing the Olympics

We were pleased to present Needle at last weekend's BarCamp Boston, right down the street at the MIT Stata Center. BarCamp is an unconference, focused on open formats and web technologies, and was a great place to show off Needle. We didn't have time to demonstrate everything that Needle can do, so we focused on our data acquisition features, along with some basic data analysis, in a presentation titled Scraping and Analyzing the Olympics. Let me walk you through how we pulled in this data.

We asked a simple enough question to start with: 'What country has won the most total winter Olympics medals, in the history of the Olympics?'  Though all the historical Olympics data was freely available from vancouver2010.com, the site itself could not answer our question. Instead, we made a quick and dirty domain model, and trained our system to pull in the historical medal data from vancouver2010.com. This involved walking through two levels of the website's menus, then drilling into a form submission – and extracting data from the selected form option. Once we generated all of the form submissions we were interested in, we wound up with 23 pages of data – one for each of the historical winter Olympics. Our machine learning algorithms pulled in this data easily — and once the data was in our system, we could manipulate and analyze it!

medalscrape

Above: Our data acquisition tool, as we train it to recognize the interesting data.

At first glance, the answer is Norway, with 280 total medals. We found this answer with minimal analysis, by simply creating a new column in our visualizer that would total all historical awards.

The truth is not that simple, however. If you look a little bit deeper at the data you will realize that Germany has won awards under four names (as listed on vancouver2010.com): Germany, German Dem Republic, Fed Repub Germany, and United Germany. Germany won medals from 1928-1936, 1952, and 1992-2006. United Germany won awards between 1956-1964. Fed Repub Germany and German Dem Republic won awards between 1968 and 1988 with one exception – Fed Repub Germany also won awards in 1952.

Using Needle's deduplication facilities, we could easily reconcile the four Germanys. Considered together, Germany has won the most winter Olympics medals, with 328!

germany

Above: Our view of the other objects that have been merged into our Germany object in our data visualizer.

We built this small demo domain quickly to show those at Bar Camp how to use Needle. For richer data on this Winter's Olympics, be sure to check our Olympics domain. I particularly like the Athlete Ages and Faces saved queries!

— Josh Ain, Needle developer

 

with a Google account


Explore sample
Needlebase domains

 

 

Mass Technology Leadership Council - 2010 Finalist

badge150x50-finalist

Follow needlebase on Twitter

Careers at ITA Software

Copyright © 2010-2011 ITA Software, Inc. · Careers · Contact · Terms of Use · Privacy