Tabulating Without Misery

If you've ever tried to tabulate more than a page or two of human-entered data, you will have learned three things:

  1. Humans are erratic.
  2. Computers are pedantic.
  3. Database software seems almost universally designed to inflict #2 on #1.
Needle is an attempt to do a data system differently: to use computer efficiency and precision to magnify human efforts and quantify human insights, and actually alleviate human tedium and misery instead of vindictively cataloging it.

The Village Voice has been running their annual Pazz & Jop Critics' Poll for almost 40 years. This year 697 music critics voted in it. There's no nomination process, you can vote for anything, so the ballots justs have blanks for typing: album, artist, label and points for albums; song, artist, label for songs.

This might not seem like that much data: it's less than 1MB if you just dump it out into a file. Easy to store, load or transfer. A modern computer can process this much data in a fraction of a second, humming happily, and tell us the winners. Unless, of course, you actually care about getting the winners right. 697 humans times 72 typing-blanks per ballot equals 50184 pieces of source data, which means 50184 opportunities for human error and variation. A computer will happily tell you which of those are the same, but to get the winners, somebody has to figure out which of those 50184 things were meant to be the same.

Take, for just one example, the song "Empire State of Mind" by Jay-Z Featuring Alicia Keys. The title is fairly simple and memorable, but even so, some voters capitalized "of" and some didn't, some put quotes around the title and some didn't, and one voter mistakenly referred to it as "New York State of Mind". The artist credit is a mess: do you write out "Featuring" or abbreviate it "Ft." or even "With"? Or do you just write "Jay-Z"? And, for that matter, does "Jay-Z" have a hyphen or not? 89 voters came up with 28 different ways to answer these questions, not even counting the people who put the song in the artist blank, and vice versa.

For decades, the editors and interns at the Voice have fought against their data with heroic, excruciating human effort: scan the lists over and over again, trying to notice and fix every error and discrepancy, one vote at a time. The computers have stood by, bearing mute witness to their suffering. This is pathetic, unnecessary and inexcusable. Software does not have to be designed to act like an anonymous cabal of resentful career bureaucrats, waiting for the human users to err and then punishing them with meticulously obtuse literal-mindedness. Data correction like this is a particularly terrible human task because it asks us to behave like machines exactly against our own strengths and natures: Humans are masters of overlooking differences, of seeing straight through errors and variations to the common patterns.

In Needle, we are trying to turn this around. The whole point of computers, we think, is to help people do things that computers are better at than humans, by eliciting human guidance for decisions where humans know better than computers, or care more. Needle attempts (among many other things) to do for data-cleanup what spell-check does for spelling: the computer is great at finding things that might be wrong, people are great at looking through these suggestions and saying yes or no or figuring out what to do instead.

Finding the errors, of course, is only half the problem. In conventional database systems, fixing problems is almost as painful as finding them. If you fix the first 23 Jay-Z+Alicia references one way, and then realize when you get to the 24th one that you should have used that version instead, you have to go back and fix the first 23 again, and hope desperately that you don't miss any. But in Needle, fixing the first 23 references turns them all into one reference that merely appears in 23 places. If you change your mind, you can edit it once. And when you combine things, you say which one is right, so if you like the 24th reference better, you just merge 1-23 into 24 instead of vice versa. No typing necessary at all.


So here is the magnitude of this difference: in 2008, using a conventional database and a bunch of Excel spreadsheets, the Voice spent about three person-weeks just trying to get the album- and song-names right, resigning themselves to showing whatever artists and labels the voters typed in, because fixing those, too, required more effort than they could expend. And they got most of the important stuff right, at the top of the winners lists, but they missed dozens of variations farther down. On one hand, it's just a music poll, so what does it matter? But on the other hand, what's the point of counting things if you're not going to count them right? And if you're a computer, what's the point of existing if you can't count things right?

This year we slurped all the ballots into Needle, and one person reconciled albums, songs, artists and labels, including matching up album-artists with song-artists, and 2009 albums/songs/artists/labels with those from 2008, in about six working hours, with most of even that much time spent in the human task of researching what the correct names and titles were supposed to be. Easy human work to make easy human decisions. In Needle terms this is a small data-set and a small project. In human terms, as Voice music editor Rob Harvilla put it, "this is nothing short of miraculous".


Needle is a domain-agnostic data system. It knows nothing special about music or music polls. What it did for this data, it can do for your data. If you are data, Needle is where you want to live. If you care about data, Needle is where you want it to be. If you remember how much your last data experience hurt, we want you to know that the next one doesn't have to.

 

Join the list for our
limited preview release


Already using Needle?
Get to work!


Explore sample
Needle domains

 

 

badge150x50-finalist
 
Follow needlebase on Twitter

Copyright © 2010 ITA Software, Inc. · Terms of Use