Everyone is anamored by the search prowess of Google. One trillion pages indexed, and all that.
The web pages that Google indexes are just the tip of the iceberg. What Google helps you do is find information, but what it doesn’t help you do is search databases. You can’t search Google for “the lowest airfare from Portland to Boston,” I mean you could but it would be a fruitless search.
Beyond those trillion pages lies an even vaster Web of hidden data: financial information, shopping catalogs, flight schedules, medical research and all kinds of other material stored in databases that remain largely invisible to search engines.
It is that stream of data that the next evolution of search engines are working to acheive. In a manner of speaking, they are trying to solve, what Google claims, is the final 10% in the search equation.
“Most search engines try to help you find a needle in a haystack,” Mr. Rajaraman said, “but what we’re trying to do is help you explore the haystack.”
That haystack is infinitely large. With millions of databases connected to the Web, and endless possible permutations of search terms, there is simply no way for any search engine — no matter how powerful — to sift through every possible combination of data on the fly.
To extract meaningful data from the Deep Web, search engines have to analyze users’ search terms and figure out how to broker those queries to particular databases. For example, if a user types in “Rembrandt,” the search engine needs to know which databases are most likely to contain information about art ( say, museum catalogs or auction houses), and what kinds of queries those databases will accept.
That approach may sound straightforward in theory, but in practice the vast variety of database structures and possible search terms poses a thorny computational challenge.
“This is the most interesting data integration problem imaginable,” says Alon Halevy, a former computer science professor at the University of Washington who is now leading a team at Google that is trying to solve the Deep Web conundrum.
One such search engine that is striving to advance web search is Kosmix. Has anyone used it, or familiar with it? Having only a cursory spin through the site, I like that it integrates information from Twitter, catalogues, news sites, blogs, etc.
But Google’s solution to the problem indicates that we’re moving to a semantic based web. Google is spidering databases via search queries and then building database models from this.
Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. For example, if the search engine finds a page with a form related to fine art, it starts guessing likely search terms — “Rembrandt,” “Picasso,” “Vermeer” and so on — until one of those terms returns a match. The search engine then analyzes the results and develops a predictive model of what the database contains.
That, of course, will make the Ubiquity Firefox extension that much more useful and the true future of web interfaces.