PrimeurLive! from the Heidelberg Supercomputer Conference, ISC2002, June 2002

The Mannheim Supercomputer Seminar is the main HPCN event in Europe. This year we publish two live issues from the event:

Contents of PrimeurLive!:

Advertisement

Advertisement
Scali

Advertisement
Scali

Advertisement
Sun

Advertisement
Platform




Fast and easy Web searching for surfers takes smart parallel supercomputing in the wings

Heidelberg 20 jun 2002 Monika Henziger from Google Inc. was invited to give the first key note presentation at ISC2002. She addressed the issues which have to be considered allowing users to efficiently and quickly find on the Web what they are searching for. To dig up relevant user information out of the two to ten billion pages on the Web, that are doubling every eight months, Google relies on smart algorithms and parallelism. Sounds like a typical job for powerful supercomputing.

Out of the 260 million surfers on the web each month, some 80 percent are querying specific search requests through powerful engines such as Google. Monika Henziger explained how her company applies distributed supercomputing to provide web users with exactly those hyperlinks which are relevant to their information needs.

To achieve this, you need to perform two basical tasks which are processing the collection of documents on the web as well as the incoming queries. In classical information retrieval, as the speaker pointed out, you can rank the data collection according to query term frequency. This is no big deal since queries tend to be long and well specified and the documents in the collection are coherent and well authored with a relatively small vocabulary.

On the web however, there is a little bit more to it. There you are confronted with a bulk of over two billion multi-lingual pages of variable quality and large diversity and a vocabulary between ten to hundred million of words. Lots of pages are duplicates or consist of non-running texts such as home pages or bookmarks. How can user needs be met in such an unstable environment?

The challenge faced by Google is enormous given the fact that there are more than 93,5 million unique web users searching for diverse topics of interest. Research shows that 85 percent of users only look at top 10 results and 78 percent of queries are not modified, as Mrs. Henziger stated.

Fortunately, the web offers some quite efficient tools to the user in order to tackle the search queries. On the side of the data collection managers, instruments are available to generate hyperlinks, analyse redundancy, produce statistics, and create interactivity with the user. Yet, there is also a big disadvantage, according to the speaker, since ranking through query term frequency does not function all that well due to the specific web pages characteristics in terms of vocabulary, spam, and huge variety. User queries are short, and therefore less precise.

The Google approach to solve these difficulties is as follows. The quality of a page is related to its in-degree and the quality of the pages linking to it. This means that a link from page A to page B is a recommendation of page B by the author of A, the speaker explained. In this way, PageRanks can be built. The PageRank of a page p can be defined as the fraction of steps which the surfer spends at p in the limit. Google has transformed the definition in a Markov algorithm.

Working with PageRanks is extremely useful because they are query-independent and highly spam-resistant. They give an overview of the general web opinion of the page relevance and are patented, as Mrs. Henziger noted.

The PageRank algorithm being established, now comes in the process of parallelism. There are three components in web indexing, the speaker summed up. The "Crawler" part of the search engine collects the documents, the "indexer" processes and represents the information, and the "query handler" processes all user queries. The parallelisation process involves some risks to be avoided such as re-crawling and the tracking of infinite spaces as well as server or connection overload. The latter can be solved with virtual hosting and re-crawling with session-ids. Google also has 23 different types of content to distribute the data.

Indexing the collection of data involves a huge parallel sorting of the words in an ordered array of their positions in the concatenation of one huge document. In a next step, the data are redirected, provided with anchor texts and incremental updates.

As far as the more than 150 million daily Google user queries are concerned, a sub-second response time can be achieved thanks to the use of over 10.000 Linux-based systems representing a supercomputer power of more than 10 teraflops, Mrs. Henziger noted.

Future challenges to meet for Google are raising the scalability - since web data and traffic are continually growing - and enhancing the fault tolerance. The scalability can be optimised by distributing the index across multiple machines and by replicating everything. Replication also comes in hand to make the parallel system of thousands of PCs more reliable.

In conclusion, Mrs. Henziger cited a few current projects in which Google is involved such as the folding@home distributed computing project at Stanford for protein analysis; the SOAP interface for searches, spelling requests and cached pages; and the open developer programme.


Leslie Versweyveld

[News on Advanced IT][Calendar][Analysis][IT in Medicine]