Saturday, September 30, 2006

Biasing Web results for topic familiarity

This post is based on a research paper by Yahoo! Research written by Omid Madani and Rosie Jones of Yahoo! and Giridhar Kumaran from University of Massachusetts. It is based on the fact that based on the user’s familiarity with the search topic, it would be appropriate to give him either introductory or advanced search results.

The findings are based on a four-fold procedure. Firstly, the definition of advanced and introductory web pages is given.

An Introductory webpage is defined as a page that doesn’t presuppose any background knowledge of the topic and to an extent introduces or defines key terms in the topic.

An advanced webpage would be one which assumes sufficient background knowledge of the topic and probably builds upon them.

Then it is shown that the definitions above hold, by the inter-labeler agreement. Three annotators are asked to label randomized sets of results for particular queries and are found to agree about 70% of the time. Also based on their labeling it is found that the search engines have an equal bias towards both introductory and advanced web pages. Also the precision for an introductory page to be at position 1 is slightly more than 0.5, showing that search engines generally make the top result an introductory one. The work tries to improve the precision for introductory documents from the positions 1 to 10.

An experiment was performed on the introductory and the advanced documents according to Fog, Flesch and Kincaid indices. All of them marked the documents as unreadable and weren’t able to distinguish the introductory from the advanced thus showing that the reading level measures aren’t enough to distinguish the documents. Also an experiment was performed in which a query was expanded using introductory trigger words. But it was found that it didn’t bring about significant improvement in the rankings of the introductory documents.

Thus a familiarity classifier was developed using reading level measures, distribution of stop words in the text and the non text features like the average line-length. This classifier when trained could label documents as introductory or advanced. It could be used to increase the precision at the top rankings by including more results there. However relevance can’t be increased this way. But the documents can be classified at crawl time, thus addressing this problem too.

The study was able to re-rank the documents, producing a statistically significantly higher proportion of introductory documents at 5 documents retrieved and at 10 documents retrieved, over baseline search engine retrieval. This kind of topic-independent, user-independent classifier is empowering for personalized search, as with a single change to the retrieval reranking, any user can specify whether they want introductory or advanced documents for any query.

Further work in this area would be integrate user profile to automatically know the knowledge level of the user, so that user doesn’t have to point out explicitly whether he wants advanced results or introductory results. This scheme could have majority of the10 results matching the user profile information. If the user clicks upon the minority results, its clear that he wants the opposite information. Also the classifier could include more features which help in better identification of advanced documents from the introductory documents.

Personalized Web Search Using Query Expansion

This post is about a paper written by Dr. Pabitra Mitra, of IIT Kharagpur. It talks about presenting personalized search results to the users by expanding the queries based on probabilistic expansion and collaborative filtering to make the ambiguous queries more meaningful.

For example, given a search keyword ”apple”, a user might be looking for the fruit apple or might be searching for apple computers. A typical search engine returns the same results regardless of who submitted the query. Hence, the need arises to have personalized web search system which gives results relevant to the user as highly ranked pages.

Query expansion removes the ambiguity in the query given by the user by adding the words capturing user’s intention.


The proposed personalization structure consists of the following steps:

  • The user gives a query to the search engine.
  • Query expansion(personalization) based on user’s profile is done. This helps to disambiguate the meaning of the given query in order to provide relevant results to the user.
  • The expanded query is submitted to the search engine.
  • Search engine results on the expanded query are represented to user.
  • The documents retrieved by the user for the given query is marked to update the user profile.
  • Update the user profile in a offline process periodically.

The Algorithm consists of an offline phase and an online phase.

The offline phase deals with building up the user profile probabilistically. Initially Query sessions for a user are formed by expanding the existing user document space by documents which are relevant for the particular query. Similar users are found by taking the cosine similarity between the document space of different users. This phase also performs pseudo query selection and updates the profile to adapt to users interests.

The online phase deals with the real-time query expansion. The query is expanded by ‘n’ document terms whose cohesion weight for the given query is above certain threshold.

Experimental evaluation is done using precision and relative recall as performance metrics. Generally, user is interestedin top results and do not go beyond the first result page (i.e. top 10 results). So, the user is requested to submit a query and to judge the top 10 documents retrieved for the original query and the personalized query. User marks documents as relevant or irrelevant. Here, the precision is computed as ratio of the number of relevant documents marked by the user to the number of documents presented to the user, and the relative recall is computed as ratio of the number of relevant documents marked by the user to the total number of distinct relevant documents retrieved for the original query and the expanded query.

Overall, there is significant increase in the average precision values without degradation in the average relative recall. The cases where the system performed poorly or gave no improvement in the performance were:

  • user’s frequent change of interests,
  • very specific queries by users, and
  • No correlation of queries to the user’s profile learnt by the system.

Mainly, the system was able to deal with the short queries and expand them correctly, to increase information content of the query to reflect user’s intention.


The major advantages of the system are that it does not require explicit feedback, it performs collaborative filtering and user clustering, and also pseudo query term selection for better query expansion, even from sparse browsing history.

Large Scale Query Analysis for Personalization Opportunities

This is a research paper by Omid Madai of Yahoo! Research, which talks about a large scale analysis of Query logs to find opportunities for personalization. The query logs involved 1.35 million browser cookies over a period of 6 moths. The paper tries to show the following three results

  • There is significantly more history for a user of a random Query rather than a random user.
  • Users exhibit consistent topical interests that vary between users
  • User’s Clicks can reveal users’ special interests

The paper starts with

“Interacting with search engines has traditionally been an impersonal affair, with the returned results a function only of the query entered”

which is actually the problem. Search Engines have to learn to show different results for different people, depending upon the users’ interests. For this the paper proposes a two fold model :

“We define contextualization as integrating a user’s history into the results ranking. We break contextualization into two types: 1) personalization is considering a user’s long-term interests, and 2) adjustment is reacting to the user’s short-term action history. Personalization and adjustment are complementary approaches to integrating user history”.

“With adjustment, a search engine could quickly react to a user’s actions”

“integrating longterm interests could address the “cold-start” situation when a user searches after a period of inactivity, or begins a new search need”


They measure the user activity as follows

A graph is plotted between the number of queries N and the number of users doing that query. Also a graph is plotted between the number of queries and the percentage of queries coming from users performing atleast N searches. The second graph signifies taking a random query from the sample and finding the probability that the query’s users performed atleast N searches. This metric seemed more promising than the first one.


The second section of the paper focuses on finding the areas of interest of the user.

“ The idea of personalization is grounded on two related assumptions about user behavior. Our first assumption is that users have reasonably consistent interests. A user’s history will only be useful if her previous actions help us predict her future interests. This will be true if her interests are consistent over time. Our second assumption is that users have different interests from one another. After all, if everyone has the same interests there is no need for personalization!”

“Note that knowing the range of topics that a user is primarily interested can have a number of personalization oriented applications, such as reordering returned pages as well as displaying ads of interest”

A query categorizer is employed to determine the topics for a query and based on the topis the interest distribution of the user is estimated. It found that the interest distribution of a user shows a positive convergence and is different from other users.


The third section is focused on finding the information that the user clicks contain. It is noteworthy that a click on portal.acm.com reveals more information that a click on Yahoo!. Also if a user returns to a url which it clicks reveals that url contains the right information for the user rather than the possibility that the url was of no use. Combining both the above features we obtain Rare and Sticky domains. Rare in the sense that very few users click them and sticky in the sense that users return to such domains. Many of the results from above fall into a category which may be called special interest. They are not generally popular but people have a intension to return. Ti\his click information could be immediately applied for reranking the results. For Example, single click on Citeseer could dramatically improve results for the rest of your search, even without any prior user history.

Clearly the interest convergence part was for the long term personalization approach and the click information would be better suited for the short term adjustment.

Thursday, September 28, 2006

As its holidays time here at Kgp, I am exploring many possibilities. At this time I am mostly interested in personlized search. Here is some material that I am looking forward to read:

Personalized Web Search using Probabilistic Query Expansion
This is my Guide's Dr. Pabitra Mitra's paper. Its not available online.
New Directions in Search
A Large-Scale Analysis of Query Logs for Assessing Personalization Opportunities
Recommender Systems Research at Yahoo! Research Labs
Biasing web search results for topic familiarity

I am also planning on reading a few articles like this, this, this, this and this by Greg Linden.

Also there is a confrenece on Next Stage Recommendor systems.

Also there's Yahoo Mindset.


Also there are some papers on Spam analysis that might be interesting:
Spam, Damn Spam, and Statistics
Detecting Spam Web Pages through Content Analysis

Thanks Chirayu for the links.

Random Sampling from a Search Engine's Index

Still to see/read about this. But from the inital impression seems to be a method to estimate a search engine's size.

Google TechTalks

August 17, 2006

Ziv Bar-Yossef joined Google from the Technion - Israel Institute of Technology in Haifa, Israel. He received his PhD from UC Berkeley in 2002, and was a Research Staff Member at the IBM Almaden Research Center prior to joining Technion.

ABSTRACT
We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface?

In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm.

We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents.

SIMS 141 - Search, Google, and Life: Sergey Brin - Google

Search Engines: Technology, Society, and Business. The World Wide Web brings much of the world's knowledge into the reach of nearly everyone with a computer and an internet connection. The availability of huge quantities of information at our fingertips is transforming government, business, and many other aspects of society. Topics include search advertising and auctions, search and privacy, search ranking, internationalization, anti-spam efforts, local search, peer-to-peer search, and search of blogs and online communities. The Instructor, Dr. Marti Hearst, is an associate professor in the School of Information at UC Berkeley, with an affiliate appointment in the Computer Science Division. The UC Berkeley School of Information was created in 1994 to address one of society’s most compelling challenges: enabling people to create, find, manipulate, share, store, and use information in myriad forms. [courses] [is141] [fall2005]

This appears to be a general talk by Sergey Brin. He doesn't reveal out much information about Google. but its nice to see that there are courses like Search Engines: Technology, Society and Business which are taken in Berkely. Just have a look at the Lecture schedule for this course, I am sure you will find something interesting.