Thursday, September 28, 2006

Random Sampling from a Search Engine's Index

Still to see/read about this. But from the inital impression seems to be a method to estimate a search engine's size.

Google TechTalks

August 17, 2006

Ziv Bar-Yossef joined Google from the Technion - Israel Institute of Technology in Haifa, Israel. He received his PhD from UC Berkeley in 2002, and was a Research Staff Member at the IBM Almaden Research Center prior to joining Technion.

ABSTRACT
We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface?

In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm.

We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents.

0 Comments:

Post a Comment

<< Home