Check Your Lists

Data-modeling structures may sometimes seem like topics strictly for data-modeling geeks, but here's an example from the Oscars to explain why they matter.

Here is a very tiny amount of information from the history of the Academy Awards:

Best Art Direction Nominations

Year Film Nominees
1941 Sergeant York John Hughes, Fred MacLean
1943 This Is the Army George James Hopkins, John Hughes, Lt John Koenig

You and I, as people, look at this little table and understand it pretty much effortlessly. It's the kind of thing we'd type into a spreadsheet. Looking at it, a couple questions probably occur to us immediately:

  • Yow, how many nominations did John Hughes get?!
  • That's not the same John Hughes who wrote Pretty in Pink, is it?

Assuming the spreadsheet has the rest of the Academy Awards data, too, you'd think it could answer the first question for us really easily. But probably it can't.

The problem is that the spreadsheet knows about rows, columns and cells, but that's it. It doesn't know about lists inside of cells. It doesn't even know they are lists. To the spreadsheet, "John Hughes, Fred MacLean" is basically the same as "Martin Luther King, Jr." The comma is just another symbol. And if we ask about "John Hughes", that's some other thing again. So from this data the spreadsheet will tell us that "John Hughes" has no nominations at all, because there are no rows whose "Nominees" cell says exactly "John Hughes". Thanks, spreadsheet. If we wanted a stupid answer we could have just made it up ourselves.

But pretend we fixed that, somehow, and assume the spreadsheet has a bunch of other movie data, too, and you'd think then it could answer our second question. But this one is even more hopeless. To the spreadsheet, the words "John Hughes" are just letters in a particular order. If we find those letters in that order in 8 places, we have no idea whether they represent 8 different people named "John Hughes", 1 person named "John Hughes" with a multi-disciplinary 57-year movie career, or some other number of people in between. So all the spreadsheet can tell us is "Yep, both those people are called 'John Hughes'!". Which is, of course, why we were asking the question in the first place.


These aren't hard problems in the science sense, of course. A relational database is technically capable of representing lists as lists, and of representing people as specific individuals instead of just plain strings of letters. The catch, however, is that the relational database structure required is significantly more complicated than a simple table, so you'll need a database programmer to even be able to enter data into it, much less ask it questions. And every place you need a list is another separate piece of required rework. Did you know that historically some Academy Awards were given to individual people for their work on multiple films? Give that database programmer a call again.

Except in the real world you probably can't afford to call the database programmer again, or else it sucked the last time you did, so you stick with your spreadsheet, and you hope some person somewhere already figured out the answers to your questions and wrote up the answers in Wikipedia or IMDB.

It shouldn't be like this. It should be just as easy to put 2 or 5 or 100 things in a "cell" as it is to put 1, and you should easily know whether you have 1 or 2 or 5 or 100, and nothing about your data-model, your UI or the form of your questions should have to change as a result of whichever it is.

Here's Needle's version of these two nominations:

Needle - Oscar History - John Hughes nominations

hughes1

The lists are lists. Most of the world is composed of lists, so basically everything in Needle is a list. Some of the lists have many things, some have one, and some are empty, but you don't have to worry which it is, or even know ahead of time. Click on John Hughes and you'll see everything this dataset has about him, including the fact that he had a total of three nominations from three films. Easy.

For the other question, here's our John Hughes comparison:

Needle - Oscar History - John Hugheses

hughes2

So no, not the same guy. Also: Wartime art director I never heard of, 3 Oscar nominations; Prolific and formative influence on my childhood: 0 Oscar nominations. Sad. Don't ask the questions if you don't want to know the answers...


And yes, this is only movies, no big deal. But Oscar History is just the most recent thing we happen to have added to Needle. The same issues apply to all data, and thus to everybody whose lives are touched by data, which is pretty much everybody. Replace Oscar nominations with your doctor visits, and "John Hughes" with something to which you're violently allergic, and it won't be "no big deal" anymore.

 

 

with a Google account


Explore sample
Needlebase domains

 

 

Mass Technology Leadership Council - 2010 Finalist

badge150x50-finalist

Follow needlebase on Twitter

Careers at ITA Software

Copyright © 2010-2011 ITA Software, Inc. · Careers · Contact · Terms of Use · Privacy