Data In, Better Data Out

There are three ways to get data into Needle: type it in yourself, import it from some structured format (usually CSV or XML), or scrape it off the sort-of-structured web. For the Pazz & Jop we did some of all three. 

Structured Import

The main information, of course, was the ballot data itself, which the Voice's tech department shipped us in a single big CSV file. CSV is as straightforwardly structured as you can get, technically speaking, but this can be less helpful than you might imagine. The ballot database is a perfectly illustrative example of what happens when you try to force interconnected data into a flat database because the pain of maintaining a non-flat relational database is so great. It looks kind of like this:

  • Voter: glenn mcdonald
  • Voter ID: 685521
  • Ballot ID: 1563062
  • Ballot Year: 2009
  • Album 1: Eight Ways
  • Album 1 Artist: Madder Mortem
  • Album 1 Label: Peaceville
  • Album 1 Points: 15
  • Album 2: Privilegivm
  • Album 2 Artist: Secrets of the Moon
  • Album 2 Label: Prophecy Productions
  • Album 2 Points: 14
  • ...

This structure exactly matches the ballot input form, which is no doubt how it came to be. As a human, looking at it, you can pretty easily make sense of what it's saying.

Try to get the computer to do anything with it, though, and you're in for immediate and lasting pain. For merely the most obvious example, say you wanted to count the votes for Eight Ways. Different voters may have listed it in different slots, so you can't just ask how many ballots say "Eight Ways" in the "Album 1" slot, you have to ask how many say "Eight Ways" in the "Album 1" slot or say "Eight Ways" in the "Album 2" slot or say "Eight Ways" in the "Album 3" slot, etc. If there were more than 10 slots, you'd probably give up before you even began.

And it isn't even that simple, because some other band might also have an album called Eight Ways, so what you really have to ask is how many ballots say "Eight Ways" in the "Album 1" slot and "Madder Mortem" in the "Album 1 Artist" slot, or "Eight Ways" in the "Album 2" slot and "Madder Mortem" in the "Album 2 Artist" slot, or etc. Ugly, ugly, ugly.

It gets even worse if you try to ask questions about the relationships between things. Like, what label is Eight Ways on? This is about the simplest question you can ask about this data: one relationship between one album and one label. But in this flattened layout, albums don't actually have relationships to labels. The album has a relationship to the ballot, and the label has a relationship to the ballot. The fact that they are listed next to each other, when the fields are sorted a particular way, is readable to a person, but basically useless to the database. The closest we could come to getting this album's label from the flat data would be asking for all the "Album 1 Label" values from all the ballots that say "Eight Ways" in the "Album 1" slot (and "Madder Mortem" in the "Album 1 Artist" slot, of course), plus all the "Album 2 Label" values from all the ballots that say "Eight Ways" in the "Album 2" slot, etc. This is the kind of thing that makes you want to put a chair-leg through your computer screen.


In Needle, then, we try to model the data the way it actually relates. A voter has a ballot. A ballot has some album votes and some single votes. An album vote is a combination of a point-score and an album. An album has an artist and a label. This is how these real concepts relate to each other in the real world. And so this is how they relate to each other in Needle.

So yes, different ballots can have votes for the same album, listing it in different positions and giving it different points. But in Needle, all of them point to the same album. And the album points to its label, directly; that question should never need to involve ballots at all.

Thankfully, importing squashed-flat CSVs into networks of correct relationships in Needle is easy. The import UI lets you tag bits of example data with the structure you want, and the transformations happen automatically. Here, for example, is what this ballot data looks like during import: 

tagging

That's the data itself on the left side, and the green Voter ID, Voter, Album, Artist, etc. labels are the tags we've applied. The target structure is shown on the right, and the green counts show how many of each thing we've found. You only have to tag a little bit of a CSV file, so the numbers are showing you that this slice has 6 ballots, which between them have (in this flat format) 60 albums, 60 point-values, etc. After conversion, the data looks more like this:

imported data

So it turns out that only one voter voted for this album, after all. But at least now we know.

Web Scraping

The "Vote" list, above, shows only one vote, but the "_Source" list shows three sources. This brings us to web-scraping, the other big way to get lots of data into Needle at once. For publication the Voice wanted labels associated with as many albums as possible. Voters are asked to provide them, but often don't, and sometimes do and get them wrong. So to help with this process, we scraped a long list of 2009 releases from the music-news site Pause and Play. The scraping works much like the CSV importing, except that in addition to tagging the data you want, you also show Needle how to navigate through the site, clicking links or filling out forms or whatever. Here's a piece of the release list:

scraping

Again, the page itself is on the left, and the green tags are what we did to identify the pieces of data we wanted. After you tag a few examples you hit the purple "guess" button and Needle tries to find the rest of them.

The published poll-results don't have anything about genre, but for this list we were also interested to know which albums qualified as Metal of some sort. So we also scraped a 2009 Metal release-list from the metal blog Invisible Oranges, and merged both these release-lists together with the ballot data. And thus, with very little additional work, we now had a pretty good idea about labels and one genre, as well as a couple extra points of reference for resolving name variations.

These two bits of web data were fairly simple, but Needle can fill out forms, follow links, click through Javascript and AJAX sites, combined paged results, parse or assemble dates, geo-code addresses, and accumulate or update changing data without undoing the effects of the merging and correction you did on the old versions.

Editing

After we got all this data imported, integrated and tabulated, there were still some things that were just wrong. There is no band called "C Oati Mundi", for example. Perhaps there should be. The computer doesn't know about bands or South/Central-American mammals, and this one only got one vote, so it didn't matter for tabulation purposes, but still, there's no reason for data to be wrong if it doesn't have to be. Everything in Needle is editable, so we just fixed it to "Coati Mundi".

It's also easy to add new relationships in Needle. One of the geeky tabulating rules for the Pazz & Jop is that votes for remixes of songs are included in the totals for the original songs (as long as it isn't the same voter voting for both). But there was no way to represent this information in the flat database, so this rule has, up until now, always been applied by humans looking at the lists over and over trying to spot the remixes and see where the counts might matter.

In Needle we simply added an extra link from song to song, called "Remix", did a query for all the songs with "remix" in the title (or a few other variations), and hooked the remixes up to their corresponding originals (and thus, automatically, vice versa). This allowed us to do the remix accounting programmatically, for every song on the list, instead of by guessing or human scanning. Thus, although the Voice's published tallies only bother noting remix points (and, similarly, 2008 carryover votes, which have their own complicated rule) for the songs at the top of the list where an extra vote or two might mean making the top 10 or 20 or missing it, the Needle version of the song chart actually shows the complete scoring for the whole set.

Republishing

The Voice has an existing system for publishing Pazz & Jop results from their database, and it wasn't a goal to redo that. So after cleaning and checking and collating all the ballots in Needle, we had to get all this lovely data back into its original flat structure. Mercifully, this is no harder than getting it in. Needle's data-exploration environment lets you easily make tables with whatever columns you need, and lets you export any table you make as a CSV file. So we shipped the clean data back to the Voice in the format they needed, and they had their winners. And then we noticed one or two more arcane and borderline irrelevant errors, fixed them, and shipped even cleaner data back to them, just because we could.

But that was only the beginning of the real fun. You tend to think of a lot more questions when you can ask them straightforwardly. So in the Needle version of this data, we can see a lot more than just static lists of winners. Click your way down the Index on the left to see winners by points or votes, 2008 carryover statistics, enthusiasm scores, what happens if you retabulate the poll with various subsets of voters, the exact distributions of points each album got from each voter, voter-overlap among the top 10 albums, charts by artist or by label, and several other interesting slices and rotations. Every bit of data on the screen is live and connected, so if you see an album you can click to find out what other songs the same band did, who voted for those songs, what those voters liked last year, etc.

All these things in the Needle UI operate behind the scenes by constructing queries in Thread, the system's query language. Thread is also available directly as a programmatic API, so you can also take the views and analyses you do in Needle and use them as the bases for your own external processing. Thus All·Idols 2009, for example, which takes all this Needle data and runs some additional correlation and similarity analysis to produce even more statistics on voters and albums.

And lastly, just to make sure you don't misconstrue this effusion of music-poll introspection, keep in mind that Needle, the underlying software, doesn't know anything about music or music polls. Everything we did with this data was done by a couple people, in a few hours, starting from a completely general-purpose data system that would be just as happy to handle carbon-emissions data or nutritional information or sports statistics or social networks. What we did with music, we can do with your data.

Better yet, what we did with music, maybe you can do with your data.

 

with a Google account


Explore sample
Needlebase domains

 

 

Mass Technology Leadership Council - 2010 Finalist

badge150x50-finalist

Follow needlebase on Twitter

Careers at ITA Software

Copyright © 2010-2011 ITA Software, Inc. · Careers · Contact · Terms of Use · Privacy