What's In a (Standardized) Name

Down in DC a few weeks ago, a friend of mine had the gall to say, "you know, you're not only a politics geek, you're a real geek geek." The nerve of the guy. This post isn't going to lessen my geek rep one iota, but whatever. What I have to report is pure awesome and I don't care who knows it. This morning, I was reading the Sunlight Foundation's Lab's director's Clay Johnson's blog post about what's next for the Labs, and a throwaway mention gave me that prickly sense down the back of my neck that I get when I know I've stumbled across something powerfully good: innovations in naming standardizations that will streamline fundraising reports, regulatory records, and more. Gadzooks! Does it get more exciting?

To realize how neat a prospect this is, you have to know what problem it solves. Here's the Sunlight wiki where the idea is being hashed over:

Names of entities-- donors, members of congress, corporations, even governments are not called the same thing between documents or databases or even in the same document. For instance, in the case of the Federal Election Commission data files, donors can be called William Smith, Billy Smith, Billy Smith, JR. or a plethora of other names. Corporations go beyond this by having multiple names-- Lorne Michaels is not only the executive producer for Saturday Night Live, but the CEO of Broadway Video and an employee of NBC Studios, a subsidiary of General Electric.

The fact that Jim Jones and James Jones III are one and the same person is a challenge to transparency, because if we never come to know that they're both the same guy, the quality of the data that powers good government drops considerably. And so, the Labs are trying to whip up algorithms and filtering techniques that boil names down to their most basic and consistent form. Once they crack that nut, they can share that knowledge with the rest of us. In some cases, Sunlight has already solved some aspects of the problem. An API now publicly available pulls members of Congress's names from a central database, so that typing "Teddy Kennedy" into a Google spreadsheet, for example, automagically resolves to "Edward Kennedy." Think that's awesome? Me too, me too.

My bedtime book of late has been doctor and New Yorker writer Atul Gawande's rather good Better: A Surgeon's Notes on Performance. I'm reminded here about his core argument: the most transformational changes in modern medicine are some of the simplest acts -- cutting down on hospital infection rates by getting doctors to wash their hands between patients, for example. It's often the most basic things that can be the most powerful.

*Note: Our Andrew Rasiej and Micah Sifry are senior advisors to the Sunlight Foundation, but that has little bearing on this post.

Comments

Guessing names is good, but...

having some definitive source to link to would be even better. If that could be standardized upon, then it would be even higher quality and easier to maintain. For example, if every reference to Ted Kennedy was followed by a definitive link about him (for example, his wikipedia page at http://en.wikipedia.org/wiki/Ted_Kennedy or perhaps some other source), then the relationships would be much stronger. Obviously this would be harder to pull off in all the systems, but we could dream. When I first read the title of the blog post, I was thinking that the suggestion would be to standardize the usage of names themselves. Of course, that would also be tough too. It sounds like the API you mention will help with the current situation of name usage, but I think some true standardization would help.

Definitive names

Well, for the members of Congress, there would be a definitive source -- the way that they're are listed in the Congressional Directory. Teddy Kennedy's "official work name" for example, is "Edward M. Kennedy." But beyond MOCs, part of the naming project involves coming up different approaches for different subjects, like always dropping the "Inc" part of a corporate name. Standardizing the way people use names is a hopeless mission, I'm afraid. The goal for a project like this is to create a very tolerant system that scoops up all sorts of references but is smart enough to know they're referencing the same real-world person or thing. It's a work on progress, to be sure, but all that's detailed on the wiki.