Our friends at Wolfram|Alpha were kind enough to invite us to their invite-only data love fest in Washington D.C., their inaugural Wolfram Data Summit. It was a veritable who’s who of the big data world, including heavy weights from big companies (i.e. Microsoft, D&B), government agencies (i.e. NASA, the Federal Reserve), and research institutions (i.e. Stanford).
The conference kicked off with a keynote from my colleague and host, Stephen Wolfram. His ambition with Wolfram|Alpha is to make all of the world’s data computable. Ours at Infochimps is to make all the world’s data accessible. You think there might be some synergies there? The most interesting part of Stephen’s speech was his announcement of a new file format: CDF (computable document format) which allows data to be computed (read: interacted with) on a web page. Stephen got some laughs when he told the audience he was a data enthusiast as evidenced by him having logged every keystroke for the last 20 years. Another interesting bit was that Wolfram|Alpha aggregates source information at the bottom of every report generated instead of detailing it because almost every computation is made across multiple data sets.
In the next breakout session I learned about the openlibary.org project (scanning the world’s books) and the Borgmann project (compiling a list of all words in the English lnaguage). Incredibly the most difficult aspect is not the technology but defining the parameters of the projects. What is a book (vs another form of publication)? What is a word (vs another verbal expression)? Erin McKean, CEO of our partner Wordnik, thinks a word is anything that can be played in scrabble and is unlikely to be challanged. We have a list of 350,000 words on Infochimps, but the Borgmann project will yield millions. I imagine we will get it on our site.
I wasn’t the only Austinite in attendance. Byron Reese, Chief Innovation Officer of Demand Media (and the guy who recruited my friend David Yehaskel to the company), spoke about the difference between data, knowledge and wisdom. According to Byron, data are observable and measurable facts, knowledge is the interpretation of data, and wisdom is the application of values to knowledge. Infochimps is making data accessible so the world can interpret it and become more knowledgable.
I spent most of lunch chatting with Derek Willis from the New York Times. He manages their APIs and joked that he prefers interviewing data as opposed to people because data doesn’t lie to his face. He got some laughs but I’m not sure I agree. When the attendees at the conference were surveyed how many people read product reviews online everyone raised their hand. But when we were surveyed who writes product reviews online only a handful of hands went up. The data is a product of how it’s collected and that is the problem with crowd sourcing.
US News talked about how searching data needs to get simpler, especially for their customers who are making once in a lifetime decisions (where to attend college). There was some debate as to how they calculated rankings. Some attendees believed US News should make the raw data available and build a widget for users to build their own rankings based on their own weightings of the various inputs.
The BBC expounded on ontologies as opposed to taxonomies for organizing data, which is a method of organizing linked data in structured ways. The speaker recommended the following as a guide: lexical analysis -> classification -> disambiguation -> relationship extraction. This has allowed the BBC to build dynamic web pages that don’t require human content managers. The key: keep your ontology simple.
Ed. note: I will add hyperlinks and my thoughts on day 2 soon.