Illuminating the Dark Geoweb
These are notes from the WhereCamp Portland morning session on dark content and the geoweb. It was led by Paul Bissett, CEO of WeoGeo. About 15 people were at the session, and brought up some very interesting points.
What is the state of the Geoweb? One of the major problems is that relevant informaton is locked away from being indexed by search engines. We call this dark content because it is unlit and unsearchable.
How much dark content it out there?
- ~800 Terabytes of data is currently unsearchable online.
- ~91,000 Terabytes is inaccessible. It’s non-searched, non-indexed digital content.
What does that mean? It means that less than a percent of the digital knowledge we’ve collected and stored online is not availabile for our use.
That means there’s no indexing, no searching, and no synergistic use of that content because it’s not being indexed. This makes it an enormous productivity sink to everyone involved to access, verify and collect data from the limited sources that are available. Only relatively ill-equipped, uninformed decisions can be made. What can we do?
Things need to be indexed
Say we want to buy a house, but we want to make sure we are purchasing a house in a safe area. You can look at a map that has earthquake zones, and one with tsunami zone, and you can even overlay all these maps to see intersections of data. Those are information layers.
But imagine doing that for every decision you make. Having layer maps for everything you do, and ever choice you make. Now, this probably doesn’t matter as much if you’re making a decision about going to Starbucks, but when you’re deciding where to put a water purification plant, or a park or recreation system, it becomes very important. How do we do this?
The good thing is that ever since Google Earth launched, geography has become cool.
But it’s one thing to use the data, and another to contribute to it to make it more rich and usable. The other problem is that most of these data sets are not text based. They require a series of information unwrapping protocols to dissect them into usable content.
You need the tools to be able to do this. You can find a file, but then you must also be able to get into it. There are also decision processes surrounding that data. Each file is different, that’s why a lot of it stays in the dark. The processing systems become as important as the data when data is so seperate and stuck in silos. There is no metadata standard/standards that would at least allow for cross indexing of different data and content. This is essential for the sharing of processing of data.
Existing metadata standards are cumbersome and there is limited motivation to use/decipher them. There’s also the scalability of data sets. Large data sets are difficult to break down into usable chunks.
The openness of data is based on different cultures. Government data has a different culture around it than Myspace. One company has the right to create something, and it is very expensive to get access to it.
All data should be sharable — so that people can build upon each other’s work.
Also see: WhereCamp